Please enable JavaScript.
Coggle requires JavaScript to display documents.
Representing and Mining text (Text data needs to be converted into…
Representing and Mining text
Text data needs to be converted into meaningful form
Unstructured data
Lunguistic structure not good for computers
TFIDF
combine term frequancy and inverse document frequncy
Each document becomes a feature vector
treat every term as a feature then assign values based on frequancy and rarity
The realationshop of IDF to entropy
Bag of Words
treat every document as a collection of indivudal words
Ignore grammer, word order, sentance structure, and punctuation
Term Frequency
how many time the word s used
Measuring sparsnes
Inverse document frequancy
term should not be too rare
term should not be too common
N-gram sequences
word order is important
look for sequences of words through adjasent terms
useful when particular phrase is important but individual words are not
Named Entitiy Extraction
recogonize common entitys in documets
extract phrases we know are meaningful
Topic Models
additonal layer between document and model
Topic layer
words map to one or more topics