Please enable JavaScript.
Coggle requires JavaScript to display documents.
Provost Ch 10: Representing and Mining Text (Important step of data mining…
Provost Ch 10: Representing and Mining Text
Important step of data mining process: data preparation
first engineer data to match existing tools
OR engineer new tools to match data
text requires pre-processing: conversion to a meaningful form
many types of files are intended to be communication between people not computers so text is still everywhere
search engines use massive amounts of text data science :
text is "unstructured" (linguistic) data
Representation
use simplest (least expensive) technique in text mining
document = one piece of text, collection of documents = corpus, token/term = words
bag of words = treat every word as potentially important
term frequency = count occurrence of words for importance in document
measuring sparseness = occurrence frequency in entire corpus (inverse frequency)
TFIDF = term frequency x inverse document frequency
IDF and entropy are similar
N-gram sequences: include sequences of adjacent words as terms
adjacent pair = bi-gram, tri-gram = 3 adjacent words
greatly increases size of feature set
easy to generate
Named entity extraction = recognize common named entities in documents (New York Mets)
trained on large corpus or hand coded based on knowledge of specific names (oakland raiders)
Topic Models = provide an extra layer between text (or entity) in document and the model
words or sequences being used are mapped to topics instead of directly to the final classifier