Representing and Mining Text

Text data

has become increasingly common with social media use on the internet

why is text important?

it is unavoidable and everywhere

legacy applications mainly, where text isnt coded for the computer

understanding customer feedback

why is text difficult?

often referred to as unstructured data

linguistic structure, not structure built for computers

because context is important

text requires lots of pre-processing

representation

general strategy is to use the simplest technique that works

document = one piece of text, no matter how large or small

documents are composed of tokens and terms (think individual words for now)

group of documents is known as a corpus

"bag of words" technique

treat every document as a collection of individual words

binary evaluation - assigns a 1 if document contains keyword, and 0 if the document does not

term frequency technique

click to edit