[Data] M0 aligned document count is much less than the orignal CNN/DM documents

Question

[Data] M0 aligned document count is much less than the orignal CNN/DM documents

La-SilverLand opened this issue 4 years ago · 3 comments

the original CNN/DM documents in total are 92579 + 219506 = 312,085
but the data for M0 (lead-3), the aligned documents are 11490, and duplicated ids are 38, which are much less than the original data.
can you explain why ?

Answer 1 · 2020-08-07T12:46:14.000Z

Hi @La-SilverLand!

The model outputs correspond to the test split, consisting of 11490 examples, and not the entire dataset.

Answer 2 · 2020-08-10T03:28:20.000Z

Does this test split use the same url list as in https://github.com/abisee/cnn-dailymail/tree/master/url_lists/all_test.txt ?
or you just randomly sample the test cases and get 11490 in total ?

Answer 3 · 2020-08-10T11:46:50.000Z

Hi @La-SilverLand!

This is the standard test set so yes, it uses the URLs in that list.