Please enable JavaScript.
Coggle requires JavaScript to display documents.
Representing and mining text (Text lingo (Frequency= word count (To make a…
Representing and mining text
Data preparation
Must engineer the data to match the the tools
Or build new tools to match data
Text data
Why it is important
It is everywhere
Communication between humans not computers
So has to be converted
Internet filled with text data
Why it is difficult
"Unstructured data"
Linguistic structure for humans not computers
Varying lengths
Sometimes order matters sometimes not
"Dirty"
Grammar not correct, words misspelled,
Contains synonyms and homographs
Context is important
Go through a good amount of preprocessing
Bag of words
Take set of documents and putting it into a familiar feature vector form
This process treats each document as a collection of individual words
Ignores grammar, word order and sentence structure (usually punctuation as well)
Treats each word as a potentially keyword of the document
Inexpensive, usually works, straightforward
Text lingo
(IR) Information retrieval
Document = one single piece of text
Could be a page long or 100 pages long
Tokens/terms
What the document is comprised of (words)
Corpus= Collection of documents
Frequency= word count
How many times a word is used
To make a table based off of word frequency
Normalize case, every term is in lowercase
Stemmed: suffixes removed
Stop words removed, like and, on, of
Not always a good idea
Measuring sparseness
Inverse document frequency
Term should not be too rare
For clustering no use in keeping a word that only occurs once
For retrieval a term may be important since user may be looking for exact word
Term should not be too common
Occurring in every document is not useful for classification
Consider distribution of corpus
N-gram sequences
When word order is important
Want to preserve some information about its sequence
Useful when phrases are significant but their component words are not
Named entity extraction
Recognize common named entities in documents
Preprocessing component
Topic models
Want to add an additional layer between the document and model
Model set of topics in the corpus separately