Please enable JavaScript.

Coggle requires JavaScript to display documents.

4 [IR] Boolean and Vector Space Model (Best Match model: Vector Space…

- - - - terms in different sentences, paragraphs
      - especially hard in full text search
    - - good, nice, excellent, outstanding, awesome
    - - Democratic party, party to a lawsuit
  - - - Precise, if you know the right strategies
      - Precise, if you have an idea of what you're looking for
      - Efficient for the computer
    - - Limited model capacity
        
        Boolean logic insufficient to capture the richness of language
        
        No control over size of result set: either too many documents or none
        
        What about partial matches? Documents that don't quite match the query may be useful also
      - Hard to users
        
        Users must learn Boolean logic
        
        User must to be familiar with the collection"When do you stop reading? All documents in the result set are considered "equally good"
- - - - Matching is calculated on relevance, similarity, or probabilities
    - - So no more set, but a long ranked lists
      - But can be ranked by other criteria, such as author, date
    - - Order documents by how likely they are to be relevant to the information need
  - - - Closer to how humane think
      - Closer to user behavior
    - - Although documents with more query terms should be "better"
      - With a ranked list of documents it does not matter how large the retrieval set is
  - - - The term frequency tf(t,d) of term t in document d is define as the number of times that t occurs in d
      - Relevance does not increase proportionally with term frequency
      - Log-frequency weighting
    - - Rare terms are more informative than frequent terms
      - We need to define the idf(inverse document frequency)
    - - increae with the rarity of the term in the collection
      - Increase with the number of occurrences within a document
  - - - Each dimension is a term in the vocabulary
      - vector elements are in real values, reflect the important of the terms
      - any vector (docs, queries...) can be compared to any other
      - cosine correlation is the similarity metrics used the most often
    - - simple to implement
      - effective to generate between anything
    - - assume independence among terms
      - does not explicitly model relevance
        
        query is just an approximation of user's information need
      - Assume that query and documents are the same
- - - - Opportunities: No weighting on query terms
      - Solution: Assume each query term occurs only once
      - Solution: The for ranking, don't need to normalize query vector
  - - - Comparing the K largest cosines: selection vs sorting
        
        Typically we want to retrieve the top K docs (in the cosine ranking for the query)
        
        idea 1: Let J = number of docs with nonzero cosines > we seek the K best of these J
        
        Use heap for selecting top K
        
        Pruning postings
        
        idea 1: index elimination: only consider high-idf query terms
        
        For a query such as novel the catcher in the rye > only accumulate scores from catcher and rye
        
        benefit: postings of low-idf terms have many docs > these (many) docs get eliminate from A
        
        idea 2: index elimination: only consider docs containing many query terms
        
        Any doc with at least one query term is a candidate for the top L output list
        
        For multi-term queries, only compute scores for docs containing several of the query terms
        
        easy to implement in postings traversal
        
        idea 3: champion list