[WT-103] Valid and test set does not match the original dataset
Opened this issue · 0 comments
DavidHerel commented
Hi,
when I run your code on WT-103 both on valid and test set I get these number of tokens for each file:
valid: 216 609
test: 244 623
but size of a valid and test set in WT-103 dataset is:
valid: 217 646
test: 245 569
(source)
I think the problem is that you do not have enough 'eos' tokens. Maybe you do not put them after each line and then this difference happens.
Is there a way how to fix it and have a correct number of tokens for valid and test set?
Thank you,
David