Improvements to BM25 and Language Models Examined

Report
Improvements to BM25 and
Language Models Examined
ANDREW TROTMAN, ANT TI PUURULA , BLAKE BURGESS
AUSTRALASIAN DOCUMENT COMPUTING SYMPOSIUM 2014
MELBOURNE, AUSTRALIA
PRESENTED BY ANT TI PUURULA
Introduction
TREC evaluations of the 90es established the current ranking functions for adhoc document retrieval
Mid 90s introduced BM25 [23], the most successful ranking function to date
Armstrong et al. [1, 2] in 2009 showed no evidence of improvements in a
decade, but multiple recent publications claim improvements
Introduction
Has there been any improvement in ranking function precision?
We examine this question testing several recent BM25 and LM ranking functions
We test each function, add relevance feedback, stemming, and stopping
Mean Average Precision (MAP) compared on INEX Wikipedia 2010 and TREC Adhoc 1-8, with functions optimized on Wikipedia 2009
Ranking functions: ATIRE BM25
Trotman et al. (2012) [27] : ATIRE version of BM25
Ranking functions: ATIRE BM25
Trotman et al. (2012) [27] : ATIRE version of BM25
Retrieval
Status
Value
for query q
Ranking functions: ATIRE BM25
Trotman et al. (2012) [27] : ATIRE version of BM25
Retrieval
Status
Value
for query q
Robertson-Walker
IDF
N = #documents
dft= #documents
term t occurs
Ranking functions: ATIRE BM25
Trotman et al. (2012) [27] : ATIRE version of BM25
Retrieval
Status
Value
for query q
Robertson-Walker
IDF
N = #documents
dft= #documents
term t occurs
BM25 term frequency normalization
tftd = count of term t in document d
Ld = length (L1-norm) of document d
Lavg= average length of documents
Ranking functions: BM25L
Lv & Zhai (2011) [12] : BM25 corrected for very long documents
Ranking functions: BM25L
Lv & Zhai (2011) [12] : BM25 corrected for very long documents
Smoothed RobertsonWalker IDF
Length-corrected BM25 term
frequency normalization
= BM25 with smoothed parameter estimates (with 1.0, 0.5, and δ added)
Ranking functions: BM25+
Lv & Zhai (2011) [11]: BM25 with lower-bounded term weights
Ranking functions: BM25+
Lv & Zhai (2011) [11]: BM25 with lower-bounded term weights
Smoothed Robertson-Walker IDF
Lower-bounding parameter
Ranking functions: BM25-adpt
Lv & Zhai (2011) [10]: BM25 with term-dependent k1, using Information Gain Gqr
Ranking functions: BM25-adpt
Lv & Zhai (2011) [10]: BM25 with term-dependent k1, using Information Gain Gqr
Term-dependent component
Smoothed Robertson-Walker IDF
Ranking functions: BM25-adpt
Lv & Zhai (2011) [10]: BM25 with term-dependent k1, using Information Gain Gqr
k’1 solved offline for each
term from the index, using a
curve-fitting technique and
the least square method
Ranking functions: BM25T
Lv & Zhai (2012) [13]: BM25 with term-dependent k1, using log-logistic method
k’1 solved offline for each term from the index, using Newton-Raphson method
Ranking functions: TFl°δ°pxIDF
Rousseau & Vazirgiannis (2013) [25]: Composite non-linear TF normalizations
Ranking functions: TFl°δ°pxIDF
Rousseau & Vazirgiannis (2013) [25]: Composite non-linear TF normalizations
Smoothed Robertson-Walker IDF
Log-concavity normalization
BM25 soft length
normalization
Lower-bounding parameter
Ranking functions: LM-DS
Zhai & Lafferty (2001): Unigram Language Model with Dirichlet Prior Smoothing
Ranking functions: LM-DS
Zhai & Lafferty (2001): Unigram Language Model with Dirichlet Prior Smoothing
Smoothing component
Matched term component
Ranking functions: LM-PYP
Momtazi & Klakow (2010): Unigram LM with Pitman-Yor Process smoothing
Ranking functions: LM-PYP
Momtazi & Klakow (2010): Unigram LM with Pitman-Yor Process smoothing
Power-law discounting
Ranking functions: LM-PYP-TFIDF
Puurula (2012): LM-PYP with TFIDF feature weighting
Ranking functions: LM-PYP-TFIDF
Puurula (2012): LM-PYP with TFIDF feature weighting
TF-IDF feature weighting
ATIRE KL-divergence feedback
Rank terms in top k retrieved documents Ri using KL-divergence
Expand query with the top n ranked terms using Rocchio feedback:
ATIRE KL-divergence feedback
Rank terms in top k retrieved documents Ri using KL-divergence
Top-k
document
model
Collection
model
Expand query with the top n ranked terms using Rocchio feedback:
Feedback
query
vector
Original query vector
Truncated model-based feedback
Reweight original query terms using the top-k documents, using posterior
probabilities of documents as mixture weights
Interpolate with original query weights
Truncated model-based feedback
Reweight original query terms using the top-k documents, using posterior
probabilities of documents as mixture weights
Interpolate with original query weights
Original query vector
Feedback query vector
Parameter optimization
Parameters for each ranking function optimized on INEX Wikipedia 2009
◦ Parameters constrained on reasonable ranges
◦ Particle Swarm Optimization with 64 particles and 20 generations
◦ 50 generations used for models with feedback (with up to 8 parameters)
Functions tested on INEX Wikipedia 2010 and TREC 1-8 datasets
◦ INEX Wikipedia 2010: same documents as INEX 2009, different queries
◦ TREC 1-8: different documents, different queries
First observations
Same Documents, Different Queries (INEX 2010):
◦ Differences between ranking functions very small
Different Documents, Different Queries (TREC 1-8):
◦ BM25-adpt slightly better than others on 5 out of 9 collections
◦ Most likely due to the collection-adaptive k1-parameters
◦ LMs generally worse than BM25 variants
◦ But ATIRE LM implementations not extensively optimized, unlike BM25
More observations
Feedback is very effective for both BM25 and LM
◦ ATIRE KL-feedback fails on LMs, truncated model-based feedback works
Stopping harms BM25 strongly, stemming can help
◦ Porter-stemming seems to harm
◦ S-stemmer and Krovetz help
Final observations: feedback+stemming
Feedback+stemming improves BM25 and LM+DP
◦ No ranking function clearly better than rest
◦ Stemming is effective
◦ Again ATIRE KL-feedback fails on LMs, truncated model-based feedback works
Paired 1-tailed t-tests of best-performing functions:
◦ Feedback is better than no feedback (p=0.0267)
◦ Stemming with feedback is better than just feedback (p=0.0292)
◦ Stemming with feedback is better than neither (p<0.0001)
Conclusions
Differences between the suggested BM25 ranking functions become very small,
when parameters are optimal for a different but similar dataset
◦ LM power-law discounting particularly brittle, BM25 parameters more stable
Feedback works for both BM25 and LM, but different feedback functions needed
Stopping harms BM25, stemming can help
Results were exploratory, but in this scenario BM25 seems to outperform LM
◦ Implementation differences can reduce ranking function performance
◦ Optimization becomes increasingly difficult with many parameters
Rewriting BM25 (BM25L example)
Robertson & Sparck-Jones 1976

similar documents