Please enable JavaScript.
Coggle requires JavaScript to display documents.
Chapter #10: Representing & Mining Text (Importance of Text (Internet…
Chapter #10:
Representing & Mining Text
Fundamental Concept:
concepts: The importance of constructing mining-friendly data representations; Representation of text for data mining.
Data are represented in ways natural to problems from which they were derived.
Applying Mining Tools
We must either
Engineer the data representation to match the tools.
Build new tools to match the data.
It generally is simpler to first try to engineer the data to match existing tools, since they are well understood and numerous.
Examining text data allows us to illustrate many real complexities of data engineering, and also helps us to understand better a very important type of data.
In principle, text is just another form of data, and t
ext processing is just a special case of representation engineering.
Importance of Text
Text is everywhere. Many legacy applications still produce or record text..
Exploiting this vast amount of data requires converting it to a meaningful form.
Internet contains a vast amount of text from different web pages.
The thrust of Web 2.0 was about Internet sites allowing users to interact with one another as a community, and to generate much added content of a site. This user-generated content and interaction usually takes the form of text.
In business, understanding customer feedback often requires understanding text. Some important consumer attitudes are represented explicitly as data or can be inferred through behavior
For example--> via five-star ratings, click-through patterns, conversion rates, and so on.
Difficulty of Text
Text is referred as
unstructured
data.
Unstructured:
Text does not have the sort of structure that we normally expect for data: tables of records with fields having fixed meanings (essentially, collections of feature vectors), as well as links between the tables.
Text is relatively dirty:
People write ungrammatically.
They misspell words
They run words together, they abbreviate unpredictably, and punctuate randomly.
Even when text is flawlessly expressed it may contain synonyms or/and homographs
.
Because text is intended for communication between people, context is important, much more so than with other forms of data.
Text must undergo a good amount of preprocessing before it can be used as input to a data mining algorithm.
The more complex the featurization, the more aspects of the text problem can be included.
Representation
The general strategy in text mining is to use the simplest (least expensive) technique that works.
Typically, all the text of a document is considered together and is retrieved as a single item when matched or categorized.
It is composed of individual tokens or terms.
Tokens/terms = words
Corpus = collection of words
Information Retrieval
A document is one piece of text, no matter how large or small. A document could be a single sentence or a 100 page report, or anything in between, such as a YouTube comment or a blog posting.
Bag of Words
Text representation task
Taking a set of documents — each of which is a relatively free-form sequence of words — and turning it into our familiar feature-vector form.
Approach:
Bag of Words
Treats every document as just a collection of individual words. This approach ignores grammar, word order, sentence structure, and (usually) punctuation.
The representation is straightforward and inexpensive to generate, and tends to work well for many tasks.
A Set =>
allows only one instance of each item, whereas we want to take into account the number of occurrences of words.
A Bag =>
is a multi-set, where members are allowed to appear more than once.
Each word is a token, and each document is represented by a one: if the token is present in the document.
Or each token can be represented as zero: the token is not present in the document.
This approach simply reduces a document to the set of words contained in it.
Frequency
Frequency means word count.
Next Step:
To use the word count in the document instead of just a zero or one. This allows us to differentiate between how many times a word is used.
Frequency Representation
The importance of a term in a document should increase with the number of times that term occurs.
Example (1):
Applying Count Representation
Example (2):
Some basic processing is performed on the words before putting them into the table. Consider.
Microsoft Corp and Skype Global today announced that they have entered into a definitive agreement under which Microsoft will acquire Skype, the leading Internet communications company, for $ 8.5 billion in cash from the investor group led by Silver Lake. The agreement has been approved by the boards of directors of both Microsoft and Skype.
Steps:
First, the case has been normalized: every term is in lowercase.
Second, many words have been stemmed: their suffixes removed, so that verbs like announces, announced and announcing are all reduced to the term announc. Also, stemming transforms noun plurals to the singular forms.
Finally, stopwords have been removed. A stopword is a very common word in English:
and, of, are, the.
1 more item...
Numbers are commonly regarded as unimportant details for text processing, but the purpose of the representation should decide this.
Measuring Sparseness: Inverse Document Frequency
*) Frequency =>
measures how prevalent a term is in a single document.
Another function of frequency is deciding the weight of a term, how common it is in the entire corpus we’re mining.
Opposing Considerations
(1) A term should not be too rare
.
For clustering, there is no point keeping a term that occurs only once: it will never be the basis of a meaningful cluster.
For this reason, text processing systems usually impose a small lower limit on the number of documents in which a term must occur.
(2) A term should not be too common.
It is not useful for classification.
It can’t serve as the basis for a cluster.
Overly common terms are typically eliminated. One way to do this is to impose an arbitrary upper limit on the number.
Many systems take into account the distribution of the term over a corpus as well. The fewer documents in which a term occurs, the more significant it likely is to be to the documents is does occur in.
Inverse Document Frequency (IDF)
I
DF(t) = 1 + log (Total Numbers of Documents) / (Number of Documents Containing)
IDF may be thought of as the boost a term gets for being rare.
when a term is very rare (far left) the IDF is quite high. It decreases quickly as t becomes more common in documents, and asymptotes at 1.0.
Combining Them: TFIDF
A very popular representation for text is the product of Term Frequency (TF) and Inverse Document Frequency (IDF), commonly referred to as TFIDF.
TFIDF (t, d) = TF (t,d) x IDF(t)
Systems employing the bag-of-words representation typically go through steps of stemming and stopword elimination before doing term counts.
Term counts within the documents form the TF values for each term, and the document counts across the corpus form the IDF values.
Each document thus becomes a feature vector, and the corpus is the set of these feature vectors.
This set can then be used in a data mining algorithm for classification, clustering, or retrieval.
Because there are very many potential terms with text representation, feature selection is often employed.
Systems do this in various ways, such as imposing minimum and maximum thresholds of term counts, and/ or using a measure such as information gain.
The bag-of-words text representation approach treats every word in a document as an independent potential keyword (feature) of the document, then assigns values to each document based on frequency and rarity.
TFIDF is a very common value representation for terms, but it is not necessarily optimal.
Relationship of IDF to Entropy
Inverse Document Frequency and entropy are somewhat similar and they both seem to measure how “mixed” a set is with respect to a property.
Formula:
The term t in a document set.
Represents the probability that a term t occurs in a document set.
p(t) = (Number of Documents Containing t) / (Total Number of Documents)
To simplify things, from here on we’ll refer to this estimate simply as p.
The 1 is just a constant so let’s discard it. We then notice that IDF( t) is basically log( 1/ p). You may recall from algebra that log( 1/ p) is equal to -log( p).
Formula:
IDF (t) = 1 + log(Total Number of Documents) / (Number of Documents Containing in t)
Beyond Bag of Words
Basic Approach Bag of Words
It requires no sophisticated parsing ability or other linguistic analysis. It performs surprisingly well on a variety of tasks, but sometimes it is no enough.
N-grans Sequence
In some cases, word order is important and you want to preserve some information about it in the representation. A next step up in complexity is to include sequences of adjacent words as terms.
Example
“The quick brown fox jumps.” it would be transformed into the set of its constituent words {quick, brown, fox, jumps}, plus the tokens quick_brown, brown_fox, and fox_jumps.
This general representation tactic is called n-grams. Adjacent pairs are commonly called bi-grams.
N-grams are useful when particular phrases are significant but their component words may not be.
Advantages vs. Disadvatages
Advantage:
An advantage of using n-grams is that they are easy to generate; they require no linguistic knowledge or complex parsing algorithm.
Disadvantage:
disadvantage of n-grams is that they greatly increase the size of the feature set.
The number of features generated can quickly get out of hand, and many of them will be very rare, occurring only once in the corpus.
Data mining using n-grams almost always needs some special consideration for dealing with massive numbers of features, such as a feature selection stage or special consideration to computational storage space.
Named Entity Extraction
Sometimes we want to apply more sophistication in phrase extraction. Also we want, a preprocessing component that knows when word sequences constitute proper names.
Many text-processing toolkits include a named entity extractor of some sort. Usually these can process raw text and extract phrases annotated with terms like person or organization.
Unlike bag of words and n-grams, which are based on segmenting text on whitespace and punctuation, named entity extractors are knowledge intensive.
The quality of entity recognition can vary, and some extractors may have particular areas of expertise.
Top Models
Learning such direct models is relatively efficient, but is not always optimal. Because of the complexity of language and documents, sometimes we want an additional layer between the document and the model.
Advantages
In a search engine, for example, a query can use terms that do not exactly match the specific words of a document; if they map to the correct topics, the document will still be considered relevant to the search.
Methods:
Include matrix factorization methods, such as Latent Semantic Indexing and Probabilistic Topic Models, such as Latent Dirichlet Allocation.
In topic modeling, the terms associated with the topic, and any term weights, are learned by the topic modeling process.
As with clusters, the topics emerge from statistical regularities in the data.
They are not necessarily intelligible, and they are not guaranteed to correspond to topics familiar to people, though in many cases they are.
Latent Information Model
Think of latent information as a type of intermediate, unobserved layer of information inserted between the inputs and outputs.
The techniques are essentially the same for finding latent topics in text and for finding latent “taste” dimensions of movie viewers.
In the case of text, words map to topics and topics map to documents.
This makes the entire model more complex and more expensive to learn, but can yield better performance. In addition, the latent information is often interesting and useful in its own right.
The main idea of a topic layer is first to model the set of topics in a corpus separately.
As before, each document constitutes a sequence of words, but instead of the words being used directly by the final classifier, the words map to one or more topics.
The topics also are learned from the data.
The final classifier is defined in terms of these intermediate topics rather than words.
The Data
The data we’ll use comprise two separate time series:
(1)
the stream of news stories.
(2)
Corresponding stream of daily stock prices.
The Internet has many sources of financial data, such as Google Finance and Yahoo Finance.
Yahoo! aggregates news stories from a variety of sources such as Reuters, PR Web, and Forbes. Historical stock prices can be acquired from many sources, such as Google Finance.
Example:
News story from the corpus
WALTHAM, Mass.--( BUSINESS WIRE)--March 30, 1999--
Summit Technology, Inc. (NASDAQ:BEAM) and Autonomous
Technologies Corporation (NASDAQ:ATCI) announced today that the Joint Proxy/ Prospectus for Summit's acquisition of
Autonomous has been declared effective by the Securities and
Exchange Commission. Copies of the document have been mailed to stockholders of both companies. "We are pleased that these proxy materials have been declared effective and look forward to the shareholder meetings scheduled for April 29," said
Robert Palmisano, Summit's Chief Executive Officer. As with many text sources, there is a lot of miscellaneous material.
As with many text sources, there is a lot of miscellaneous material since it is intended for human readers and not machine parsing.
The story includes the date and time, the news source stock symbols and link (NASDAQ:BEAM), as well as background material not strictly germane to the news.
Each such story is tagged with the stock mentioned.
Sidebars: The News are Messy
The financial news corpus is actually far messier than this one story implies, for several reasons.
(1)
Financial news comprises a wide variety of stories, including earnings announcements, analysts’ assessments, SEC filings, financial balance sheets, and so on.
(2)
Stories come in different formats, some with tabular data, some in a multi-paragraph “lead stories of the day” format, and so on.
(3)
Stock tagging is not perfect. It tends to be overly permissive, such that stories are included in the news feed of stocks that were not actually referenced in the story.
In short, the relevance of a stock to a document may not be clear without a careful reading. With deep parsing (or at least story segmentation) we could eliminate some of the noise, but with bag of words (or even named entity extraction) we cannot hope to remove all of it..