Validation set for pre-training UmBERTo with fairseq
go-inoue opened this issue · 2 comments
Hi,
I have a question on the dataset used to pretrain UmBERTo.
When you pre-trained the model using fairseq, I believe you had to specify a validation set in fairseq-preprocess
to get the training script (fairseq-train
) to run, otherwise fairseq-train
gives an error (FileNotFoundError: Dataset not found: valid
).
What data did you use for the --validpref
argument? Compared to training data (~7GB Wikipedia and ~69GB Commoncrawl), how big is it?
Hi @go-inoue,
sorry for late response.
Our split is 70 % for training, 15 % for valid and 15 % for test.
The split was done at text-level, then every file was transformed in .bpe and fed to training.
--trainpref train.bpe
--validpref valid.bpe
--testpref test.bpe
To give you an idea: for commoncrawl dataset train.bpe is 74 gb, valid.bpe and test.bpe are 14 gb eachone.
Hi @simonefrancia,
Thanks! I'm closing this issue.