musixmatchresearch/umberto

Validation set for pre-training UmBERTo with fairseq

go-inoue opened this issue · 2 comments

Hi,

I have a question on the dataset used to pretrain UmBERTo.

When you pre-trained the model using fairseq, I believe you had to specify a validation set in fairseq-preprocess to get the training script (fairseq-train) to run, otherwise fairseq-train gives an error (FileNotFoundError: Dataset not found: valid).

What data did you use for the --validpref argument? Compared to training data (~7GB Wikipedia and ~69GB Commoncrawl), how big is it?

Hi @go-inoue,
sorry for late response.
Our split is 70 % for training, 15 % for valid and 15 % for test.
The split was done at text-level, then every file was transformed in .bpe and fed to training.

 --trainpref train.bpe 
 --validpref valid.bpe 
 --testpref test.bpe 

To give you an idea: for commoncrawl dataset train.bpe is 74 gb, valid.bpe and test.bpe are 14 gb eachone.

Hi @simonefrancia,

Thanks! I'm closing this issue.