MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries

In this work we find that while rankers have witnessed an impresive improvement in the oeprformance during the recent years, there are still a significant number of queries that cannot be addressed by any of the state of the art neural rankers. We refer to these queries as obstinate queries because of their difficulty. This means that regardless of the neural ranker, these queries will not see any performance improvements and the increase in overall performance reported by the ranker are due to improvements on another selected subset of queries. We believe that careful treatment on these queries will lead to the a more stable and consistent performance of neural rankers across all the queries.

Please find more details on the paper: MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries (CIKM 2021)

We investigate the performance of SOTA rankers on MSMARCO small dev set which contains 6980 queries. We noticed no matter which baseline method is considered, whether it be a traditional BM25 ranker or a complex neural ranker, there is a noticeable number of queries for which the rankers are unable to return any reasonable ranking. Further there is a noticeable number of poorly performed queries that are in common acroos all the rankers. Table 1 illustrates the performance of the 'difficult' queries qhich are among least 50% performance of each baseline and are in common in 4,5 and 6 of SOTA rankers

Table 1. : MAP Performance of the rankers on 50% hardest queries of the Chameleon datasets.

Variations Dataset Name Number of Queries BM25 DeepCT DocT5Query RepBert ANCE TCT-Colbert
Common in
6 rankers
Lesser
Chameleon
1693 0.0066
(Run)
0.0122
(Run)
0.0185
(Run)
0.0212
(Run)
0.0286
(Run)
0.0267
(Run)
Common in
5 rankers
Pygmy
Chameleon
2473 0.0215
(Run)
0.0240
(Run)
0.0403
(Run)
0.0398
(Run)
0.0546
(Run)
0.0462
(Run)
Common in
4 rankers
Veiled
Chameleon
3119 0.0392
(Run)
0.0400
(Run)
0.0660
(Run)
0.0560
(Run)
0.0847
(Run)
0.0780
(Run)

We made all the runs available in the Chameleons Google drive.

Baseline rankers implementation

You can find the details of implementation of each method here.

Query Reformulation

Furthermore, given the literature has reported that hard queries can often be due to issues such as vocabulary mismatch, and hence can be improved through query reformulation, we report the performance of several strong query reformulation techniques on the MSMarco Chameleons dataset and show that such queries remain stubborn and do not report noticeable performance improvements even after systematic reformulation.

The expanded queries can be found here which are implemented using ReQue toolkit.

Map pn the 50% Outstalet's dataset
Category query BM25 DeepCT DocT5 RepBert ANCE TCT-ColBert


Psuedo-Relevance Feedback
Relevance feedback 0.0477 (query) 0.0574 (query) 0.0566 (query) 0.0513 (query) 0.0277 (query) 0.0693 (query)
RM3 0.0407 (query) 0.0375 (query) 0.0603 (query) 0.0459 (query) 0.0374 (query) 0.0610 (query)
Document clustering 0.0392 (query) 0.0393 (query) 0.0593 (query) 0.0550 (query) 0.0609 (query) 0.0765 (query)
Term Clustering 0.0412 (query) 0.0424 (query) 0.0567 (query) 0.0557 (query) 0.0693 (query) 0.0724 (query)


External Sources
Neural Embeddings (query) 0.0218 0.0248 0.0285 0.0409 0.0468 0.0462
WikiPedia (query) 0.0277 0.0313 0.0341 0.0368 0.0466 0.0396
Thesaurus (query) 0.0277 0.0313 0.0341 0.0368 0.0466 0.0396
Entity Linking (query) 0.0399 0.0450 0.0543 0.0507 0.0533 0.0649
Sense Disambigution (query) 0.0359 0.0360 0.0521 0.0512 0.0653 0.0633
ConceptNet (query) 0..0269 0.0278 0.0342 0.0369 0.0488 0.0442
WordNet (query) 0.0271 0.0569 0.0346 0.0359 0.0399 0.0406


Supervised Approaches
ANMT (Seq2Seq) (query) 0.0002 0.0007 0.0010 0.0020 0.0046 0.0066
ACG (Seq2Seq + Attention) (query) 0.0240 0.0307 0.0359 0.0433 0.0450 0.0470
HRED-qs (query) 0.006 0.002 0.003 0.0060 0.0082 0.0110

It should be noted that the psuedo-relevance feedback-based query expansion methods are different for each run since they are based on initial first round of retrieval. However, for other methods, the expansion queries are the same.

Please cite our work as:

@inproceedings{arabzadehcikm2021-3,
  author    = {Negar Arabzadeh and Bhaskar Mitra and Ebrahim Bagheri,},
  title     = {MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries},
  booktitle = {The 30th ACM Conference on Information and Knowledge Management (CIKM 2021)},
  year      = {2021}
}

Authors

Negar Arabzadeh, Bhaskar Mitra and Ebrahim Bagheri

Laboratory for Systems, Software and Semantics (LS3), Ryerson University, ON, Canada.