Please enable JavaScript.
Coggle requires JavaScript to display documents.
3 [IR] Index Construction and Compression (Query Processing and doc…
3 [IR] Index Construction and Compression
Query Processing and doc indexing
Perform the exact steps as in indexing
wildcard queries
Stemming helps in coping with morphological variations
retrieval system has to provide wildcard query capabilities
Permuterm indexes for wildcard queries: $
N-gram for wildcard queries
Spelling check
check misspell
based on dictionary
based on search reaults
make correction suggestions
Resources
based on index terms in the collection
based on query log
based on dictionary
Methods
edited distance
n-gram overlapping
Index Construction
Features
given a term, all related information needs to be easily accessible
create a vocabulary list, a dictionary like structure
Which documents contains a term need to be available
an array telling which docs contain the terms
Problem
no quick
term-document matrix is very sparse
no information about some terms are more useful than others
posting: avoid data sparse and explicitly indicate docs
an array of only the docs contain the terms
no position information for proximate queries
no information about some terms are more useful than others
a more elaborated inverted index!!!! >> posting (id: pos1, pos2, pos3)
Heap's law: calculate the vocabulary size
Index Compression
Compress Posting Lists
Opportunities
Distribution of numbers is skewed
The longest lists take the most space, compressing them save most space
Solution
Delta encoding
Unary code
Gamma code
Class Goals
basic idea of constructing index
Know the reasons why index is constructed into such inverted format with various component
know the basics of index compression
Can design simple index construction and compression modules
Query languages
Two types of languages
Boolean or structured
Free text or bag-of-words queries
Operator: AND / OR
Many systems support a combination of both
Zipf's law
Zipf's law related a term's frequency to its rank
It's quite accurate except for very high and row rank
Impact on IR
Good
stopword will account for a large fraction of text so eliminating them greatly reduce inverted-index storage costs >> take up 50% of the text
Bad
For most words, gathering sufficient data for meaningful statistical analysis is difficult since they are extremely rare
Statistical properties of text
summary
Term usage is highly skewed, but in a locally predictable patten
important?
optimization of data structure
statistical retrieval algorithms depend on them
Postings
size: 10% of the size of the documents
20% of collection size
enormous for proximity operators when position is also needed
sometimes larger than the document collection
Hardware issues
data size and speed
speed: External storage < Hard drive < main memory < CPU chces
External storage > Hard driver > Main memory > CPU caches
2
OS often read/write entire blocks of data
data trandsfer between disk and memory is handled by bus
3
Size of Index could be too big to be stored in RAM
Operations needed to look for a term in term list can be huge
better data structure
Solutions:
Decoupling
Access methods to inverted file
enable accurate and efficient document retrieval
Organize the term list with efficient access mehtoda
B-tree
Hash table
efficient use of RAM