spacemanidol/MSMARCO

Uncommon train / dev / test split of ranking dataset

drennings opened this issue · 5 comments

Hi,

I have two questions about the train/dev/test split of the ranking dataset. I noted that:

  • The train set queries.train consists of 502,939 questions, of which all 502,939 have at least 1 answer in qrels.train.
  • The dev set queries.dev consists of 12,665 questions, of which only 6,986 have at least 1 answer in qrels.dev.
  • The test set queries.eval consists of 12,560 questions.

Now, my questions are:

  1. Why was roughly a 40:1:1 split made instead of e.g. a more common 8:1:1 split?
  2. Why do only (roughly) 55% of the queries in dev have an answer whereas 100% of the queries in train have an answer?

Thanks in advance!

Hey.

Good eye. Seems I uploaded the wrong files but I have fixed it. We had initially subsampled the files when experimenting since the set is so big.

New queries
101093 queries.dev.tsv
101092 queries.eval.tsv
502939 queries.train.tsv
new qrels
45684 qrels.dev.tsv
401023 qrels.train.tsv

The sizes are going to be a little different since for the train set we are removing all queries that do not have an answer(original train is ~800,000) but we have not removed these from dev and eval in order to keep these sets help out and not affecting the other msmarco task.

That being said, the percentage of queries that do not have answers is about the same across splits(~35%) so the sets are now matched.

Great, thanks for the fix and clarification!

Let me double check if I get things right:

  • It seems that queries.dev, queries.eval, qrels.train and qrels.dev have been updated, whereas collection and queries.train have not been updated.
  • Since the queries in train and passages in the collection have stayed the same while the amount of answers for the train split has been lowered (from 532,761 to 401,023 entries in qrels.train), there is now a set of questions in queries.train that do not have an answer in qrels.train, but do have an answer passage present in the collection (that is, presuming that the original qrels.train contained 532,761 correct entries).

If these statements are correct, I would wonder what the use would be of the questions in questions.train that actually have an answer in collection.tsv that is not mentioned in qrels.train. If each split should contain queries for which no answer exists, shouldn't queries.train then also be updated? For instance:

  • Instead of containing some queries for which no answer is present in qrels.train, shouldn't it contain queries for which there actually is no answer present in the dataset? (this would be the desired update from my user perspective)
  • Or, if you prefer not to provide such additional queries for which no answer is present in the dataset, then shouldn't the queries that have an answer in collection.tsv which is not mentioned in qrels.train at least be removed from queries.train?

Hey,

Sorry for closing this early.

You are correct those files were updated. It seems that the original queries.dev and queries.eval were just subsamples of the actual queries so I included the whole part. The qrels.train became smaller because it seems there was some normalization with the collection.tsv and the person who did that is on vacation. Once I fix this normalization error I will update the collection.tsv file and the qrels file. The expectede file should be ~550k for train, 56k for dev(and about the same for eval).

It worth noting that the queries.* files are only for ease of joining sets they are not used in evaluation. For evaluation your system will be reranking passages for a query where there is an answer. Your system score is based on how highly your system is able to rank the relevant passages(qrels). Since there are a few times where the BM25 model did not return the passage marked as relevant(few but they happen) a system will never be able to achieve a perfect 1 for MRR. I will post the theoretical maximum MRR for this dataset shortly.

Hey,

No problem at all, thanks for your reply.

By "normalization error", do you mean that there are now duplicate passages in collection.tsv (passages that have a different id but the same contents)?
And that these duplicate documents will be removed from collection.tsv, so that the file will only contain unique passages, and that all qrels files will be updated accordingly?

Looking forward to the updated dataset!

No by normalization I mean that some chars were removed in collection.tsv that weren't removed elsewhere the ids are constant and the size is the same.

If you go ahead and check the updated qrels you will now find the full files.
erasmus@spacemanidol:~/MSMARCOV2/Ranking/Baselines/DataDir$ wc -l qrels.*
59273 qrels.dev.tsv
59187 qrels.eval.tsv
532761 qrels.train.tsv
651221 total
There may be an update in the future to the dataset but for now feel free to have at it!