Common samples across train/test splits in NYT dataset?
Opened this issue · 0 comments
DushyantaDhyani commented
Are the train/test splits in the repo the latest one's? I saw common document id's across the dataset
grep -Fxf idnewnyt_train.json idnewnyt_test.json
gives
Nytimes/2005/1671581.xml
Nytimes/1989/0305839.xml
Nytimes/1996/0854262.xml
Nytimes/1999/1093505.xml
Nytimes/2003/1456362.xml
Nytimes/1991/0486652.xml
Nytimes/1999/1121980.xml
Nytimes/1999/1121980.xml
Nytimes/1992/0567045.xml
Nytimes/1991/0465614.xml
Nytimes/2003/1456362.xml
Nytimes/1989/0305839.xml
Nytimes/1996/0854262.xml
Nytimes/1999/1115706.xml
Nytimes/1989/0305839.xml
Nytimes/2001/1348476.xml
Nytimes/2003/1456362.xml
Nytimes/2005/1671581.xml
Nytimes/2002/1364943.xml
Nytimes/1999/1120579.xml
Nytimes/1996/0854262.xml
Nytimes/2000/1202113.xml
Nytimes/1989/0235478.xml
Nytimes/1993/0617185.xml
Nytimes/1995/0768783.xml
Nytimes/1994/0682119.xml
Nytimes/1995/0780235.xml
Nytimes/1999/1090992.xml
Similarly,
grep -Fxf idnewnyt_train.json idnewnyt_val.json
gives
Nytimes/1995/0804632.xml
Nytimes/1992/0513267.xml
Nytimes/2000/1221219.xml
Nytimes/1995/0780235.xml
Nytimes/1997/0956260.xml
Nytimes/2005/1671581.xml
Nytimes/2005/1677600.xml
Nytimes/1988/0167030.xml
Nytimes/2003/1456362.xml
Nytimes/2000/1203815.xml
Nytimes/1999/1120180.xml
Nytimes/1989/0305839.xml
Nytimes/1995/0794787.xml
Nytimes/1989/0305839.xml
Nytimes/2000/1226769.xml
Nytimes/1999/1120161.xml
Nytimes/1997/0903961.xml
Nytimes/2000/1227135.xml