Please enable JavaScript.

Coggle requires JavaScript to display documents.

3 [IR] Index Construction and Compression (Query Processing and doc…

- - - - based on dictionary
      - based on search reaults
    - - Resources
        
        based on index terms in the collection
        
        based on query log
        
        based on dictionary
      - Methods
        
        edited distance
        
        n-gram overlapping
- - - - create a vocabulary list, a dictionary like structure
    - - an array telling which docs contain the terms
        
        Problem
        
        no quick
        
        term-document matrix is very sparse
        
        no information about some terms are more useful than others
    - - an array of only the docs contain the terms
        
        no position information for proximate queries
        
        no information about some terms are more useful than others
    - - Heap's law: calculate the vocabulary size
- - - - Distribution of numbers is skewed
      - The longest lists take the most space, compressing them save most space
    - - Delta encoding
      - Unary code
      - Gamma code
- - - - Operator: AND / OR
- - - - stopword will account for a large fraction of text so eliminating them greatly reduce inverted-index storage costs >> take up 50% of the text
    - - For most words, gathering sufficient data for meaningful statistical analysis is difficult since they are extremely rare
  - - - Term usage is highly skewed, but in a locally predictable patten
      - important?
        
        optimization of data structure
        
        statistical retrieval algorithms depend on them
- - - - better data structure