Please enable JavaScript.
Coggle requires JavaScript to display documents.
Representing and Mining Text (Text... (text is important because it is…
Representing and Mining Text
Text...
text is important because it is communication
almost everything we use everyday online use text
need to understand text to understand business problems and solutions
text is hard because it is unstructured
text's linguistic structure is easy for humans but hard for computers
computers can't understand abbreviations, punctuation, ect.
important to know the context of the text you are working with
text needs a lot of preprocessing before used in an algorithm
Representation
steps to transform text into data
always use the simplest / cheapest method
bag of words approach treats every document as a collection of individual words
Term frequency is used next
this differentiates how many times a word is used
usually puts words into a table and shows their count
measuring spareness decides the weight of each term used in the bag of words
doesn't typically keep words used once
overly common terms are also not used
see book for equation
Combining them (TFIDF) which stands for the value of a term (t) in a given document (d)
each document can then be used for algorithms
term counts get TF values
see jazz musicians example
Other methods
N-gram sequences
word order is important
each feature is an individual word
useful when some phrases are significant and others aren't
disadvantage is that they increase feature sizes
Name entity extraction
recognizes common phrases / entities
very knowledge intensive
need to be trained / coded very well to recognize these entities
Topic Models
topic layers are added btwn the document and the model
topic layers model the topics separately before they are put into documents
topics emerge from statistical regularities in data