About differences between collected_data and tokenized data.
ichiroex opened this issue · 1 comments
Thank you for sharing with us your interesting dataset.
I'm curious about the differences between collected_data and tokenized data.
What did you process the collected_data to generate the tokenized_data?
Originally, I've tried to split the collected_data into train/val/test splits by using the train_id.json/val_id.json/test_id.json in the data folder.
But, the number of examples in each split differs from the train/val/split in your paper as below.
[In my case]
train: 92,585
val: 12,851
test: 12,839
[In your paper]
train: 92,283
val: 12,792
test: 12,779
However, I found that the number of the train/val/test examples in the tokenized_data folder equals your paper.
Did you apply any filtering process to the collected_data?
Hi, there are a few sentences filtered from collected_data, please refer to this line:
Table-Fact-Checking/code/preprocess_data.py
Line 524 in 948b556