Information Retrieval
2
Indexing
Process of convert document
into representation
Indexes
Generate a set of useful terms
Process
Stop word removal
Stemming
Tokenization
Issue
Apostrophes can be a part of a word
- 1890's
Numbers can be important
- RM 15
Capitalized words can have different meaning
- Bush fire vs bush fire
Abbreviations
- Ph.D
Phrases
- POS tagging
Inverted file
- Store Inverted index with
counts/position
Retrieval
Function
Documents are retrieved according
to a score computing
using ranking algorithm
Vector Space
Model
Boolean Retrieval
Weight=tf.idf
Inverse Document
Frequency
Term
Frequency
Measures importance in document
\( tf_{t,d} = \frac{f_{t,d}}{\sum^{N}_{j=1}f_{j,d}} \)
, where
\( tf_{t,d} \) is total term t in document d
\( f_{j,d} \) is total term j in document d
Measures term importance in collection
\(idf_t = log \frac{N}{N_t}\)
, where
\(N_t \) is total documents contain term t
N is total documents in collection
Similarity
Measure
\(cos(D_d, Q) = \frac{\sum^{T}_{t=1}D_{d,t}\cdot Q_t}{\sqrt{\sum^{T}_{t=1}(D_{d,t})^2 \sum^{T}_{t=1}(Q_{t})^2}} \)
Evaluation
Evaluate function, preprocessing steps
Performance Metric
Precision
- \( P = \frac{TP}{TP + TN}\)
Recall
- \(R = \frac{TP}{TP + FN} \)
Advantages
- Result predictable, easy to explain
- Efficient processing, many documents can be eliminated
Disadvantages
- Simple query doesn't work well
- Complex query are difficult