[WT-103] Valid and test set does not match the original dataset

Question

[WT-103] Valid and test set does not match the original dataset

Opened this issue a year ago · 0 comments

Hi,

when I run your code on WT-103 both on valid and test set I get these number of tokens for each file:
valid: 216 609
test: 244 623

but size of a valid and test set in WT-103 dataset is:
valid: 217 646
test: 245 569
(source)

I think the problem is that you do not have enough 'eos' tokens. Maybe you do not put them after each line and then this difference happens.
Is there a way how to fix it and have a correct number of tokens for valid and test set?

Thank you,
David