Please enable JavaScript.
Coggle requires JavaScript to display documents.
Representing and Mining Text (Text data (why is text important? (it is…
Representing and Mining Text
Text data
has become increasingly common with social media use on the internet
why is text important?
it is unavoidable and everywhere
legacy applications mainly, where text isnt coded for the computer
understanding customer feedback
why is text difficult?
often referred to as unstructured data
linguistic structure, not structure built for computers
because context is important
text requires lots of pre-processing
representation
general strategy is to use the simplest technique that works
document = one piece of text, no matter how large or small
documents are composed of tokens and terms (think individual words for now)
group of documents is known as a corpus
"bag of words" technique
treat every document as a collection of individual words
binary evaluation - assigns a 1 if document contains keyword, and 0 if the document does not
term frequency technique