Please enable JavaScript.
Coggle requires JavaScript to display documents.
4 [IR] Boolean and Vector Space Model (Best Match model: Vector Space…
4 [IR] Boolean and Vector Space Model
Extract Match model: Boolean model
Why Boolean Retrieval Works
Boolean operators apprximate natural language
AND can discover relationships between concepts
OR can discover alternate terminology
AND NOT can remove alternate meanings
Why Boolean Retrieval Fails
Natural language is way more complex
AND "discovers" non existent relationships
terms in different sentences, paragraphs
especially hard in full text search
guessing terminology fot OR is hard
good, nice, excellent, outstanding, awesome
guessing terms to exclude is even harder
Democratic party, party to a lawsuit
Strengths of Boolean Match Model
Strengths
Precise, if you know the right strategies
Precise, if you have an idea of what you're looking for
Efficient for the computer
Weaknesses
Limited model capacity
Boolean logic insufficient to capture the richness of language
No control over size of result set: either too many documents or none
What about partial matches? Documents that don't quite match the query may be useful also
Hard to users
Users must learn Boolean logic
User must to be familiar with the collection"When do you stop reading? All documents in the result set are considered "equally good"
Best Match model: Vector Space Model
Characteristics
Query describes a good or "best" matching document
Matching is calculated on relevance, similarity, or probabilities
All documents has some degree of matching to the query
Usually return documents in ranked order of the degree
So no more set, but a long ranked lists
But can be ranked by other criteria, such as author, date
aka "ranked retrieval"
Order documents by how likely they are to be relevant to the information need
Why Ranked Retrieval
Arranging documents by relevance is
Closer to how humane think
Closer to user behavior
Solve the feast or famine problem in Boolean retrieval
Best (partial) match: documents need not have all query terms
Although documents with more query terms should be "better"
With a ranked list of documents it does not matter how large the retrieval set is
Scoring
Term frequency tf
The term frequency tf(t,d) of term t in document d is define as the number of times that t occurs in d
Relevance does not increase proportionally with term frequency
Log-frequency weighting
document Frequency: df
Rare terms are more informative than frequent terms
We need to define the idf(inverse document frequency)
tf-idf weighting
increae with the rarity of the term in the collection
Increase with the number of occurrences within a document
Vector Space Model
ranked documents by their similarity with the query
aka similarity based model
both documents and queries are represented on vectors
Using "inner product" to calculate similarity
get tf-idf to form a vector space model
normalized the vector
calculate the weight and retrieval the results
Summary
Basic
Each dimension is a term in the vocabulary
vector elements are in real values, reflect the important of the terms
any vector (docs, queries...) can be compared to any other
cosine correlation is the similarity metrics used the most often
advantages
simple to implement
effective to generate between anything
Limitations
assume independence among terms
does not explicitly model relevance
query is just an approximation of user's information need
Assume that query and documents are the same
Efficiency in Vector Space Model
Efficient Cosine Ranking - I
Find the K docs in the collection "nearest" to the query > K largest query-doc cosine
Special Case: unweighted queries
Opportunities: No weighting on query terms
Solution: Assume each query term occurs only once
Solution: The for ranking, don't need to normalize query vector
Efficient Cosine Ranking - II
Computing a single cosine efficiently
choosing the K largest cosine values efficiently
Key question: Can we do this without comparing all N cosine?
Comparing the K largest cosines: selection vs sorting
Typically we want to retrieve the top K docs (in the cosine ranking for the query)
idea 1: Let J = number of docs with nonzero cosines > we seek the K best of these J
Use heap for selecting top K
Pruning postings
idea 1: index elimination: only consider high-idf query terms
For a query such as novel the catcher in the rye > only accumulate scores from catcher and rye
benefit: postings of low-idf terms have many docs > these (many) docs get eliminate from A
idea 2: index elimination: only consider docs containing many query terms
Any doc with at least one query term is a candidate for the top L output list
For multi-term queries, only compute scores for docs containing several of the query terms
easy to implement in postings traversal
idea 3: champion list
Goal
Know the basic ideas of Boolean and vector space models
know the advantages and limitations of extra match model and best match model
Familiar with the term weighting functions and similarity methods for vector space model
Familiar with some implementation considerations