/ruberta

Russian RoBERTa

Primary LanguagePythonMIT LicenseMIT

Russian RoBERTa

Training a Rossian RoBERTA model using:

Reqiurements

  • pytorch
  • youtokentome
  • tensorboard for training visualisation

Training

  1. Train a tokenizer model and split data on train/valid/test (change paths if needed):
$ python3 ./scripts/run_pretraining.py
  1. Encode and binarize the data:
$ ./scripts/run_encoding.sh
  1. Start training:
$ ./scripts/run_train_16.sh

Pretrained models

A model trained on russian Wiki + Taiga corpus:

RuBERTa-base, batch 264, 65k steps

F1 score on Sber SQuAD dataset: 78.60