/turkish-language-models

Transformer based Turkish language models

Primary LanguagePython

Turkish Language Models

Introduction

In this repository, we publish Transformer based Turkish language models and related tools.

New models, datasets, updates and tutorials are on the way. Please keep in touch with us.

For any question or request, feel free to open an issue. We are an enthusiastic R&D team to contribute to Turkish NLP community and we need your feedbacks.

You can also check our Zemberek-Python implementation. It is fully Python integrated and neither Java nor JVM are required.

Updates

August 28, 2020 Initial version of this repo.

December 8, 2020 New models have been trained with different corpus and training parameters. Finetuning scores have been updated with optimal hyper parameters.

  • Bert-cased is newly added.
  • Bert-uncased is updated.

Language Models

ELECTRA, ALBERT, BERT

Trained Models

Model Vocabulary Size (Thousand) Train Raw Text Data Size (GB) Train Step (Million)
loodos/electra-small-turkish-cased-discriminator 32 200 1
loodos/electra-small-turkish-uncased-discriminator 32 40 1
loodos/electra-base-turkish-uncased-discriminator 32 40 1
loodos/electra-base-turkish-64k-uncased-discriminator 64 200 1
loodos/bert-base-turkish-uncased 32 40 5
loodos/bert-base-turkish-cased 32 40 5
loodos/albert-base-turkish-uncased 32 40 3

Pretraining Details

  • Our training dataset(totally 200 GB raw Turkish text) is collected from online blogs, free e-books, newspapers, common crawl corpuses, Twitter, articles, Wikipedia and so.

  • We are not able to share all of our dataset due to copyright issues. But we are planning to share some part of it. Check our website frequently.

  • You can use filtered common crawl OSCAR corpus from here.

  • On pretraining and finetuning, we have found a normalization issue specific to Turkish in both Google's and Huggingface's repos. You can check details from here. We are waiting Huggingface for solving this issue. Until it is solved, we significantly suggest you to use our TextNormalization module before tokenizing.

Finetune Results

Results are recorded with our super computer, called Pantheon.

System Specs:

  • AMD Ryzen 3950x
  • 2 x Titan RTX
  • 128 GB Ram
  • 4 TB SSD and 30 TB HDD
  • MSI x570 Godlike Motherboard
  • Ubuntu 18.04, Cuda 10.1, Python 3.8.6, Tensorflow 2.3.1, Pytorch 1.7.0, Transformers 3.4.0

DBMDZ models are also known as BERTurk.

NER

Huggingface's token classification example is used for scoring.

Our training dataset is the same dataset that has been used by "Mustafa Keskin, Banu Diri, “Otomatik Veri Etiketleme ile Varlık ̇Ismi Tanıma”, 4st International Mediterranean Science and Engineering Congress (IMSEC 2019), 322-326."

You can contact Prof. Dr. Banu Diri for reaching the dataset and other details of the conference paper.

Hyper parameters(others are default):

task_type: NER
max_seq_len: 512
learning_rate: 5e-5
num_train_epochs: 5
seed: 1
per_gpu_train_batch_size: depends on model
Model F1 score
loodos/electra-small-turkish-cased-discriminator 0.63
loodos/electra-small-turkish-uncased-discriminator 0.81
loodos/electra-base-turkish-uncased-discriminator 0.87
loodos/electra-base-turkish-64k-uncased-discriminator 0.86
loodos/bert-base-turkish-uncased 0.89
loodos/bert-base-turkish-cased 0.90
loodos/albert-base-turkish-uncased 0.85
dbmdz/bert-base-turkish-128k-cased 0.89
dbmdz/bert-base-turkish-128k-uncased 0.80
dbmdz/bert-base-turkish-uncased 0.89
dbmdz/bert-base-turkish-cased 0.90
dbmdz/electra-base-turkish-cased-discriminator 0.89
dbmdz/electra-small-turkish-cased-discriminator 0.79

QA

Huggingface's question-answering example is used for scoring.

Training dataset is TQuad.

  • We have modified it a bit to match Squad v2 format and removed some of the answers with misleading "answer_start" parameter.

Hyper parameters(others are default):

max_seq_len: 512
doc_stride: 128
learning_rate: 3e-5
num_train_epochs: 5
per_gpu_train_batch_size: depends on model
Model Exact Score F1 Score
loodos/electra-small-turkish-cased-discriminator 28.57 47.00
loodos/electra-small-turkish-uncased-discriminator 36.09 57.68
loodos/electra-base-turkish-uncased-discriminator 54.82 73.75
loodos/electra-base-turkish-64k-uncased-discriminator 54.70 75.39
loodos/bert-base-turkish-uncased 58.40 75.86
loodos/bert-base-turkish-cased 58.29 76.37
loodos/albert-base-turkish-uncased 46.63 66.31
dbmdz/bert-base-turkish-128k-uncased 59.41 77.50
dbmdz/bert-base-turkish-128k-cased 60.53 77.49
dbmdz/bert-base-turkish-uncased 59.75 76.48
dbmdz/bert-base-turkish-cased 58.40 76.19
dbmdz/electra-base-turkish-cased-discriminator 57.39 77.51
dbmdz/electra-small-turkish-cased-discriminator 31.61 53.08

Sentiment Analysis

Our custom text classification project is used for scoring. TFAlbertForSequenceClassification, TFBertForSequenceClassification and TFElectraForSequenceClassification of Huggingface's Transformers are used in that project.

Training dataset is Sentiment Analysist. We have normalized each sentence in the dataset with Zemberek for better scoring before training. The samples were split to %80 train, %20 test.

Hyper parameters(others are default):

max_seq_len: 192
learning_rate: 1e-6
epsilon: 1e-7
num_train_epochs: 15
per_gpu_train_batch_size: depends on model
Model Test score
loodos/electra-small-turkish-cased-discriminator 66.22
loodos/electra-small-turkish-uncased-discriminator 78.97
loodos/electra-base-turkish-uncased-discriminator 89.70
loodos/electra-base-turkish-64k-uncased-discriminator 88.17
loodos/bert-base-turkish-uncased 92.17
loodos/bert-base-turkish-cased 91.52
loodos/albert-base-turkish-uncased 88.59
dbmdz/bert-base-turkish-128k-uncased 91.92
dbmdz/bert-base-turkish-128k-cased 91.94
dbmdz/bert-base-turkish-uncased 91.02
dbmdz/bert-base-turkish-cased 91.67
dbmdz/electra-base-turkish-cased-discriminator 91.85
dbmdz/electra-small-turkish-cased-discriminator 83.70

Citation

We are preparing a paper about our models and results. Until it is published, you can give this repository in your citations.

Special Thanks

We are thankful to TensorFlow Research Cloud (TFRC) for providing TPUs for pretraining our models, to DBMDZ for their detailed pretraining cheatsheet and Prof. Dr. Banu Diri for her guidance and providing some datasets.