Please enable JavaScript.

Coggle requires JavaScript to display documents.

Chapter #10: Representing & Mining Text (Importance of Text (Internet…

- - - - Engineer the data representation to match the tools.
      - Build new tools to match the data.
      - It generally is simpler to first try to engineer the data to match existing tools, since they are well understood and numerous.
- - - - The more complex the featurization, the more aspects of the text problem can be included.
- - - - Treats every document as just a collection of individual words. This approach ignores grammar, word order, sentence structure, and (usually) punctuation.
        
        The representation is straightforward and inexpensive to generate, and tends to work well for many tasks.
    - - Each word is a token, and each document is represented by a one: if the token is present in the document.
        
        Or each token can be represented as zero: the token is not present in the document.
        
        This approach simply reduces a document to the set of words contained in it.
- - - - To use the word count in the document instead of just a zero or one. This allows us to differentiate between how many times a word is used.
        
        Frequency Representation
        
        The importance of a term in a document should increase with the number of times that term occurs.
        
        Example (1):
        
        Applying Count Representation
        
        Example (2):
        
        Some basic processing is performed on the words before putting them into the table. Consider.
        
        Microsoft Corp and Skype Global today announced that they have entered into a definitive agreement under which Microsoft will acquire Skype, the leading Internet communications company, for $ 8.5 billion in cash from the investor group led by Silver Lake. The agreement has been approved by the boards of directors of both Microsoft and Skype.
        
        Steps:
        
        First, the case has been normalized: every term is in lowercase.
        
        Second, many words have been stemmed: their suffixes removed, so that verbs like announces, announced and announcing are all reduced to the term announc. Also, stemming transforms noun plurals to the singular forms.
        
        Finally, stopwords have been removed. A stopword is a very common word in English: and, of, are, the.
        
        1 more item...
        
        Numbers are commonly regarded as unimportant details for text processing, but the purpose of the representation should decide this.
- - - - Many systems take into account the distribution of the term over a corpus as well. The fewer documents in which a term occurs, the more significant it likely is to be to the documents is does occur in.
        
        Inverse Document Frequency (IDF)
        
        IDF(t) = 1 + log (Total Numbers of Documents) / (Number of Documents Containing)
        
        IDF may be thought of as the boost a term gets for being rare.
        
        when a term is very rare (far left) the IDF is quite high. It decreases quickly as t becomes more common in documents, and asymptotes at 1.0.
- - - - Because there are very many potential terms with text representation, feature selection is often employed.
        
        Systems do this in various ways, such as imposing minimum and maximum thresholds of term counts, and/ or using a measure such as information gain.
      - The bag-of-words text representation approach treats every word in a document as an independent potential keyword (feature) of the document, then assigns values to each document based on frequency and rarity.
        
        TFIDF is a very common value representation for terms, but it is not necessarily optimal.
- - - - The term t in a document set.
        
        Represents the probability that a term t occurs in a document set.
      - p(t) = (Number of Documents Containing t) / (Total Number of Documents)
      - To simplify things, from here on we’ll refer to this estimate simply as p.
        
        The 1 is just a constant so let’s discard it. We then notice that IDF( t) is basically log( 1/ p). You may recall from algebra that log( 1/ p) is equal to -log( p).
        
        Formula:
        
        IDF (t) = 1 + log(Total Number of Documents) / (Number of Documents Containing in t)
- - - - N-grans Sequence
        
        In some cases, word order is important and you want to preserve some information about it in the representation. A next step up in complexity is to include sequences of adjacent words as terms.
        
        Example
        
        “The quick brown fox jumps.” it would be transformed into the set of its constituent words {quick, brown, fox, jumps}, plus the tokens quick_brown, brown_fox, and fox_jumps.
        
        This general representation tactic is called n-grams. Adjacent pairs are commonly called bi-grams.
        
        N-grams are useful when particular phrases are significant but their component words may not be.
        
        Advantages vs. Disadvatages
        
        Advantage:An advantage of using n-grams is that they are easy to generate; they require no linguistic knowledge or complex parsing algorithm.
        
        Disadvantage:disadvantage of n-grams is that they greatly increase the size of the feature set.
        
        The number of features generated can quickly get out of hand, and many of them will be very rare, occurring only once in the corpus.
        
        Data mining using n-grams almost always needs some special consideration for dealing with massive numbers of features, such as a feature selection stage or special consideration to computational storage space.
- - - - Unlike bag of words and n-grams, which are based on segmenting text on whitespace and punctuation, named entity extractors are knowledge intensive.
        
        The quality of entity recognition can vary, and some extractors may have particular areas of expertise.
- - - - In a search engine, for example, a query can use terms that do not exactly match the specific words of a document; if they map to the correct topics, the document will still be considered relevant to the search.
        
        Methods:
        
        Include matrix factorization methods, such as Latent Semantic Indexing and Probabilistic Topic Models, such as Latent Dirichlet Allocation.
        
        In topic modeling, the terms associated with the topic, and any term weights, are learned by the topic modeling process.
        
        As with clusters, the topics emerge from statistical regularities in the data.
        
        They are not necessarily intelligible, and they are not guaranteed to correspond to topics familiar to people, though in many cases they are.
        
        Latent Information Model
        
        Think of latent information as a type of intermediate, unobserved layer of information inserted between the inputs and outputs.
        
        The techniques are essentially the same for finding latent topics in text and for finding latent “taste” dimensions of movie viewers.
        
        In the case of text, words map to topics and topics map to documents.
        
        This makes the entire model more complex and more expensive to learn, but can yield better performance. In addition, the latent information is often interesting and useful in its own right.
- - - - WALTHAM, Mass.--( BUSINESS WIRE)--March 30, 1999--
        
        Summit Technology, Inc. (NASDAQ:BEAM) and Autonomous
        
        Technologies Corporation (NASDAQ:ATCI) announced today that the Joint Proxy/ Prospectus for Summit's acquisition of
        
        Autonomous has been declared effective by the Securities and
        
        Exchange Commission. Copies of the document have been mailed to stockholders of both companies. "We are pleased that these proxy materials have been declared effective and look forward to the shareholder meetings scheduled for April 29," said
        
        Robert Palmisano, Summit's Chief Executive Officer. As with many text sources, there is a lot of miscellaneous material.
        
        As with many text sources, there is a lot of miscellaneous material since it is intended for human readers and not machine parsing.
        
        The story includes the date and time, the news source stock symbols and link (NASDAQ:BEAM), as well as background material not strictly germane to the news.
        
        Each such story is tagged with the stock mentioned.
      - Sidebars: The News are Messy
        
        The financial news corpus is actually far messier than this one story implies, for several reasons.
        
        (1)Financial news comprises a wide variety of stories, including earnings announcements, analysts’ assessments, SEC filings, financial balance sheets, and so on.
        
        (2) Stories come in different formats, some with tabular data, some in a multi-paragraph “lead stories of the day” format, and so on.
        
        (3) Stock tagging is not perfect. It tends to be overly permissive, such that stories are included in the news feed of stocks that were not actually referenced in the story.
        
        In short, the relevance of a stock to a document may not be clear without a careful reading. With deep parsing (or at least story segmentation) we could eliminate some of the noise, but with bag of words (or even named entity extraction) we cannot hope to remove all of it..