Please enable JavaScript.
Coggle requires JavaScript to display documents.
6 [IR] Evaluation of IR systems (Evaluaion Measures (Good effectiveness…
6 [IR] Evaluation of IR systems
Cranfield method and test collection
Background
process is complicated with many decision points
During indexing of documents
During processing of queries
During matching queries against documents
evaluation becomes the center issue in IR
All novel techniques need demonstrate superior performance on representative document collections
Basic
Early and influential studies on IR effectiveness
Goal was to compare human and automatic indexing methods
Basics of the 1967 experiment
Importance: experimental methodology influential even today
Test collection that consists of a document collection, a set of information needs (search requests) ,and the ground truth between the documents and the needs
steps
Obtain a test collection
Obtain a corpus of documents
Obtain a set of search requests
Obtain the ground truth judgements
Conduct retrieval using different experiment conditions
Obtain several search results
3 .Measure the effectiveness of the searches by comparing the relevant documents in ground truth and those in the search results
To truly compare different retrieval algorithms, may need multiple test collections
Evaluation measures
Statistical testing
Gaol
Know the basic ideas of cranfield evaluation framework
know the advantages and limitations of the measures, and is able to calculate mean average precisions
familiar with running an evaluation on a retrieval system using Cranfield evaluation framework
How
How to collect relevance judgements
In text collection, the most expensive part is the ground truth
The quality of relevance judgements can affect the evaluation results
manually
solutions
search-guided relevance assessment
Known-item judgments
Pooling
How to find relevant documents
search-guided relevance assessment
Iterate between topic research/ search/ assessment
Known-item judgments have the lowest cost
Tailor queries to retrieve a single known document
Useful as a first cut to see if a new technique is viable
Pooling Method
use top n documents from each search result to build a pool for judgment
single pool, duplicates removed, arbitrary order
Judged by the person who developed the topic
Relevant document set is the union of all relevant document from each result
Treat unevaluated documents as not relevant
To make pooling work
Systems must do reasonable well
Systems must not all "do the same thing"
Work?
Kappa measure
for inter-judge agreement
agreement measure among judges
Designed for categorical judgements
Corrects for chance agreement
kappa = [p(A) - P(E)] / [1- P(E)]
P(A): proportion of time judge agree
P(E) - what agreement would be by chance
kappa = 0 for chance agreement, 1 for total agreement
Evaluaion Measures
Good effectiveness measure
should capture some aspect of what the user wants
Should have predictive value for other situations
should be easily replicated by other reaserchers
should be easily comparable
Many measures
view search results as a set: precision, recall, F measure,
View search results as a ranked list: mean average precision, NDCG
Set-based basic measures
Precision, recall, miss, false alarm
Precision and recall
should be view together
precision is the bullseye or the needle in the haystack
recall is the kitchen sink
problematic if consider only one of them
a perfect precision retrieval system
a perfect recall retrieval system
Measuring
Average Precision
Mean Average Precision
F measure as Harmonic Mean
returned documents in relevant or not difference, but at the same time the ranking of each relevant document is also important
R-precision
precision at the R-th position in the ranking of results for a query that we knows has E relevant documents in the collection
Mean Reciprocal Rank(MRR)
MRR is the mean of Reciprocal Rank of a set of queries
MRR only care about higher rank results
Cumulative Gain(CG)
the sum of the graded relevance values of all results in a search result list.
Discounted Cumulative Gain
reason: CG is not really sensitive to the ranking
ranked list follow two assumptions
highly relevant documents are more useful when appearing earlier in a search engine result list
highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than irrelevant documents
use Discounted Cumulative Gain(DCG)
Because there is not range limitation for DCG, we need to normalize the value >> NDCG
ideal discounted cumulative gain: sort the ranked list by relevant score