Domain of the corpora used for pretraining

Question

Domain of the corpora used for pretraining

Closed this issue 3 years ago · 2 comments

Hello!

First of all, thanks for this wonderful collection of pretrained models. I wonder what is the domain of the corpora used for pretraining BERTurk, DistilBERTurk, ConvBERTurk, and ElecTRa. I would like to cite these models in a scientific publication and give an idea about the domain knowledge made available to the model during pretraining.

Answer 1 · 2021-10-09T12:47:57.000Z

Hi @ehalit,

thanks for using our Turkish models 🤗

A detailed overview of different sized used as pretraining corpora can be found here:

https://github.com/stefan-it/turkish-bert#turkish-model-zoo

The corpus with 35GB consists of the Turkish OSCAR corpus (first version), a recent Wikipedia dump, various OPUS corpora and a special corpus provided by Kemal Oflazer.

The 242GB version comes from the multilingual mC4 corpus: allenai/allennlp#5265.

I'm currently preparing a new BERTurk model, that is trained on the recently updated Turkish part of OSCAR corpus.

All information about the pretraining corpora can also be found here, at the end of that repo you also find a BibTeX entry 🤗

Answer 2 · 2021-10-11T10:48:46.000Z

Thanks for your reply.