MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries

In this work we find that while rankers have witnessed an impresive improvement in the oeprformance during the recent years, there are still a significant number of queries that cannot be addressed by any of the state of the art neural rankers. We refer to these queries as obstinate queries because of their difficulty. This means that regardless of the neural ranker, these queries will not see any performance improvements and the increase in overall performance reported by the ranker are due to improvements on another selected subset of queries. We believe that careful treatment on these queries will lead to the a more stable and consistent performance of neural rankers across all the queries.

Please find more details on the paper: MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries (CIKM 2021)

We investigate the performance of SOTA rankers on MSMARCO small dev set which contains 6980 queries. We noticed no matter which baseline method is considered, whether it be a traditional BM25 ranker or a complex neural ranker, there is a noticeable number of queries for which the rankers are unable to return any reasonable ranking. Further there is a noticeable number of poorly performed queries that are in common acroos all the rankers. Table 1 illustrates the performance of the 'difficult' queries qhich are among least 50% performance of each baseline and are in common in 4,5 and 6 of SOTA rankers

Table 1. : MAP Performance of the rankers on 50% hardest queries of the Chameleon datasets.

Variations	Dataset Name	Number of Queries	BM25	DeepCT	DocT5Query	RepBert	ANCE	TCT-Colbert
Common in 6 rankers	Lesser Chameleon	1693	0.0066 (Run)	0.0122 (Run)	0.0185 (Run)	0.0212 (Run)	0.0286 (Run)	0.0267 (Run)
Common in 5 rankers	Pygmy Chameleon	2473	0.0215 (Run)	0.0240 (Run)	0.0403 (Run)	0.0398 (Run)	0.0546 (Run)	0.0462 (Run)
Common in 4 rankers	Veiled Chameleon	3119	0.0392 (Run)	0.0400 (Run)	0.0660 (Run)	0.0560 (Run)	0.0847 (Run)	0.0780 (Run)

We made all the runs available in the Chameleons Google drive.

Baseline rankers implementation

You can find the details of implementation of each method here.

Query Reformulation

Furthermore, given the literature has reported that hard queries can often be due to issues such as vocabulary mismatch, and hence can be improved through query reformulation, we report the performance of several strong query reformulation techniques on the MSMarco Chameleons dataset and show that such queries remain stubborn and do not report noticeable performance improvements even after systematic reformulation.

The expanded queries can be found here which are implemented using ReQue toolkit.

		Map pn the 50% Outstalet's dataset
Category	query	BM25	DeepCT	DocT5	RepBert	ANCE	TCT-ColBert
Psuedo-Relevance Feedback
	Relevance feedback	0.0477 (query)	0.0574 (query)	0.0566 (query)	0.0513 (query)	0.0277 (query)	0.0693 (query)
	RM3	0.0407 (query)	0.0375 (query)	0.0603 (query)	0.0459 (query)	0.0374 (query)	0.0610 (query)
	Document clustering	0.0392 (query)	0.0393 (query)	0.0593 (query)	0.0550 (query)	0.0609 (query)	0.0765 (query)
	Term Clustering	0.0412 (query)	0.0424 (query)	0.0567 (query)	0.0557 (query)	0.0693 (query)	0.0724 (query)
External Sources
	Neural Embeddings (query)	0.0218	0.0248	0.0285	0.0409	0.0468	0.0462
	WikiPedia (query)	0.0277	0.0313	0.0341	0.0368	0.0466	0.0396
	Thesaurus (query)	0.0277	0.0313	0.0341	0.0368	0.0466	0.0396
	Entity Linking (query)	0.0399	0.0450	0.0543	0.0507	0.0533	0.0649
	Sense Disambigution (query)	0.0359	0.0360	0.0521	0.0512	0.0653	0.0633
	ConceptNet (query)	0..0269	0.0278	0.0342	0.0369	0.0488	0.0442
	WordNet (query)	0.0271	0.0569	0.0346	0.0359	0.0399	0.0406
Supervised Approaches
	ANMT (Seq2Seq) (query)	0.0002	0.0007	0.0010	0.0020	0.0046	0.0066
	ACG (Seq2Seq + Attention) (query)	0.0240	0.0307	0.0359	0.0433	0.0450	0.0470
	HRED-qs (query)	0.006	0.002	0.003	0.0060	0.0082	0.0110

It should be noted that the psuedo-relevance feedback-based query expansion methods are different for each run since they are based on initial first round of retrieval. However, for other methods, the expansion queries are the same.

Please cite our work as:

@inproceedings{arabzadehcikm2021-3,
  author    = {Negar Arabzadeh and Bhaskar Mitra and Ebrahim Bagheri,},
  title     = {MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries},
  booktitle = {The 30th ACM Conference on Information and Knowledge Management (CIKM 2021)},
  year      = {2021}
}

Authors

Negar Arabzadeh, Bhaskar Mitra and Ebrahim Bagheri

Laboratory for Systems, Software and Semantics (LS3), Ryerson University, ON, Canada.

Narabzad/Chameleons

MSMarco Chameleons: Challenging the MSMarco Leaderboard with Extremely Obstinate Queries

Baseline rankers implementation

Query Reformulation

Authors