Information Retrieval
2

Indexing

Process of convert document
into representation

Indexes
Generate a set of useful terms

Process

Stop word removal

Stemming

Tokenization

Issue

Apostrophes can be a part of a word

  • 1890's

Numbers can be important

  • RM 15

Capitalized words can have different meaning

  • Bush fire vs bush fire

Abbreviations

  • Ph.D

Phrases

  • POS tagging

Inverted file

  • Store Inverted index with
    counts/position

Retrieval
Function

Documents are retrieved according
to a score computing
using ranking algorithm

Vector Space
Model

Boolean Retrieval

Weight=tf.idf

Inverse Document
Frequency

Term
Frequency

Measures importance in document

\( tf_{t,d} = \frac{f_{t,d}}{\sum^{N}_{j=1}f_{j,d}} \)
, where
\( tf_{t,d} \) is total term t in document d
\( f_{j,d} \) is total term j in document d

Measures term importance in collection

\(idf_t = log \frac{N}{N_t}\)
, where
\(N_t \) is total documents contain term t
N is total documents in collection

Similarity
Measure

\(cos(D_d, Q) = \frac{\sum^{T}_{t=1}D_{d,t}\cdot Q_t}{\sqrt{\sum^{T}_{t=1}(D_{d,t})^2 \sum^{T}_{t=1}(Q_{t})^2}} \)

Evaluation

Evaluate function, preprocessing steps

Performance Metric

Precision

  • \( P = \frac{TP}{TP + TN}\)

Recall

  • \(R = \frac{TP}{TP + FN} \)

Advantages

  • Result predictable, easy to explain
  • Efficient processing, many documents can be eliminated

Disadvantages

  • Simple query doesn't work well
  • Complex query are difficult