/botcuk-dataset-analyze

BERTurk model performance results and datasets

English, Türkçe

The results and datasets are a part of my dissertation. It will be published soon.

BERTurk Performance Analysis on Text Classification and Question Answering Tasks in Turkish Datasets

The datasets that are used in this project were trained in order to be used in text classification and question answering tasks by using the BERTurk model and Colab platform. The obtained results are published in this repository.

The datasets were cleaned and standardized and divided into training (70%), validation (20%), and testing (10%). In addition, the character and word counts of each input were calculated to be used in visual analysis, and the elements of the sentence were extracted with the Zemberek tool and included in the datasets.

You can find all fine-tuned models on Huggingface.

Question Detection Datasets

Dataset Best Model Accuracy Precision Recall F1
Dialog Dataset ConvBERTurk 0.958773 0.951311 0.892570 0.921005
Quora Dataset ELECTRA Base 0.959178 0.952355 0.893072 0.921762
Tweet Dataset ELECTRA Base 0.788375 0.790655 0.788375 0.787725

Question Answering Datasets

Dataset Best Model Exact Match F1
TQuad Dataset ELECTRA Base 61.5385 80.3351
YTU Dataset ELECTRA Base 65.0746 82.9919