Training a Rossian RoBERTA model using:
- YouTokenToMe as a tokenizer.
- Fairseq toolkit
- pytorch
- youtokentome
- tensorboard for training visualisation
- Train a tokenizer model and split data on train/valid/test (change paths if needed):
$ python3 ./scripts/run_pretraining.py
- Encode and binarize the data:
$ ./scripts/run_encoding.sh
- Start training:
$ ./scripts/run_train_16.sh
A model trained on russian Wiki + Taiga corpus:
RuBERTa-base, batch 264, 65k steps
F1 score on Sber SQuAD dataset: 78.60