Please enable JavaScript.

Coggle requires JavaScript to display documents.

BA Arbeit: Topic Modeling - Coggle Diagram

- - - - achieves significant compression in large collections
      - derived features of this method can capture some apsects of basic linguistic notions such as synonymy and polysemy
      - "uses a singular value decomposition of the X matrix to identify a linear subspace in the space of td-idf features that captures most of the variance in the collection." 1 - p. 2
      - based on "bags-of-words"-assumption
      - "first technique that can produce document representation in the form of a collective of words [...] by attaching the Bag-of-Words feature as a document representation" 2 - p. 1
    - - "models each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of "topics"" 1 - p.2
      - alternative to LSI
      - each word is generated from a single topic, and different words in a document may be generated from different topics
      - each document is represented as a list of mixing proportions for these mixture components and thereby reduced to a probability distribution on a fixed set of topics
      - based on "bags-of-words"-assumption
      - developed by Hoffman in 1999
      - "uses probabilistic values as a determinant of the topic weight of each existing document" 2 - p.1
  - - - "documents are represented as random mixtures over latent topics, where each topic is characterized by a distribtution over words" p. 4
- - - - in einem dokument - welche topics sind in diesem document mit welchem verteilung vorhanden?
      - wir haben für jedes Dokument eine topic distribution
    - - funktioniert dann, wie reverse engingeering, indem es dann versucht die topic distribution mit den jeweiligen begriffen der topics herauszufinden
        
        which word did come from which topic?
        
        which distribution did each document have?
      - drawing words from each topic in the proportion, how often it occures; es wird nicht immer das wahrscheinlichste Wort aus dem Topic aufgeschrieben, aber wahrscheinliche Wörter werden häufger aufgeschrieben
      - LDA geht davon aus, dass in Blick auf die Verteilung jeweils Wörter vom jeweiligen Topic in den Dokumenten stehen
    - - given that I have a certain topic, what is the probability of seeing this specific word?
      - top words in topics are called 'topic descriptors'
      - für jedes Topic: wie häufig kommt ein Wort vor bzw. mit welcher wahrscheinlichkeit kommt ein wort dort vor
    - - beta parameter:
        
        low: each word is in one particular topic and not in more than one -> very specialized topics
        
        high: all words for all topics
        
        how specifc are words to topics? bec' words can be in many topics - controls words - word distribution for topics
      - alpha parameter:
        
        high alpha: I will see a lot of topics in one document
        
        low alpha: one or two topics that dominate document
        
        how many topics per document? controls topic distribution
      - parameters: encourages model to search in that directions; not deterministic