[Data] M0 aligned document count is much less than the orignal CNN/DM documents
La-SilverLand opened this issue · 3 comments
La-SilverLand commented
the original CNN/DM documents in total are 92579 + 219506 = 312,085
but the data for M0 (lead-3), the aligned documents are 11490, and duplicated ids are 38, which are much less than the original data.
can you explain why ?
Alex-Fabbri commented
Hi @La-SilverLand!
The model outputs correspond to the test split, consisting of 11490 examples, and not the entire dataset.
La-SilverLand commented
Does this test split use the same url list as in https://github.com/abisee/cnn-dailymail/tree/master/url_lists/all_test.txt ?
or you just randomly sample the test cases and get 11490 in total ?
Alex-Fabbri commented
Hi @La-SilverLand!
This is the standard test set so yes, it uses the URLs in that list.