Information Retrieval 2

Information Retrieval
2

Indexing

Process of convert document
into representation

Indexes
Generate a set of useful terms

Process

Stop word removal

Stemming

Tokenization

Issue

Apostrophes can be a part of a word

Numbers can be important

Capitalized words can have different meaning

Abbreviations

Phrases

Inverted file

Retrieval
Function

Documents are retrieved according
to a score computing
using ranking algorithm

Vector Space
Model

Boolean Retrieval

$Weight = tf.idf$

Inverse Document
Frequency

Term
Frequency

Measures importance in document

$ tf_{t,d} = \frac{f_{t,d}}{\sum^{N}_{j=1}f_{j,d}} $
, where
$ tf_{t,d} $ is total term t in document d
$ f_{j,d} $ is total term j in document d

Measures term importance in collection

$idf_t = log \frac{N}{N_t}$
, where
$N_t $ is total documents contain term t
N is total documents in collection

Similarity
Measure

$cos(D_d, Q) = \frac{\sum^{T}_{t=1}D_{d,t}\cdot Q_t}{\sqrt{\sum^{T}_{t=1}(D_{d,t})^2 \sum^{T}_{t=1}(Q_{t})^2}} $

Evaluation

Evaluate function, preprocessing steps

Performance Metric

Precision

Recall

Advantages

Disadvantages