Please enable JavaScript.
Coggle requires JavaScript to display documents.
5 [IR] Statistical Language Model (Statistical Language Models (Summary of…
5 [IR] Statistical Language Model
Efficiency in Vector Space Model
Probability basics
P(A,B) = P(A|B)xP(B) = P(B|A)P(A)
Statistical Language Models
Unigram Language Model
Prediction > P(s|M)
using the language model to assign a probability to a text
Estimation
observing relevant and estimating the probability of each word
Steps
Obtain relevant text for a language model to be estimated
Perform tokenization on the collect text
do not remove stop words
Counting
Summary of various Language Models
Unigram Language Model
Bigram Language Models
Other Language Models
Grammar-based models
Multinomial /Multiple-Bernoulli
Multinomial
can account for multiple word occurences in the query
Widely used in many NLP areas
Possibility for integration with ASR, MT, NLP
Multiple-Bernoulli
May suit to IR(directly check on the presence od query terms)
Provisions for explicit negation of query terms("A AND NOT B")
increasingly less popular than multinomial method
Document likelihood
2
Estimate a language model Mq for query Q
Rank documents by the likelihood of being a random sample from MQ
issue is estimation of query model
Treat query as generated by mixture of topic and background
estimat relevance model from related documents (query expansion)
relevance feedback is easily incorporated
But: different document length, probabilities are not comparable
likelihood ratio: model 2'
Using Bayes' likelihood that MQ is the source, given that we observed document D
But does not provide modeling on document size
Language Modeling: pros and cons
pros
Formal mathematical model
Simple, well-understood framework
Integrated both indexing and retrieval models
Natural use of collection statistics no heuristics
avoid tricky issues of "relevance", "aboutness"
cons
Difficult to incorporate notions of "relevance", user preferences
Relevance feedback / query expansion not straightforward
Can't accomodate phrase, passages, Boolean operator
but there are recent LM works that address these issues
Language Modeling vs Vector Space
Similarities
Term weight based on frequncy
Terms often used as if they were independent
Inverse document/collection frequency used
Some form of length normalization useful
Differences
Based on probability rather than similarity
Intuitions are probabilistic rather than geometric
Details of use of document length and term, document, and collection frequency differ
Summary
Both LMs and BIR provide theoretical sound retrieval models based on probabilities
Explicitly model the uncertainty among the understanding of information need and that of the representation of documents and queries
But there are difference between LMs and BIR model
Both LMs and extension of probabilistic retrieval models are among the state of art retrieval models