This repository belongs to a Bachelor thesis on a topic of Transfer Learning for Text Data Analysis from Vysoká škola báňská - Technická univerzita Ostrava
https://dspace.vsb.cz/handle/10084/147359
The aim of this bachelor thesis was to test transfer learning methods on different datasets and then compare results with simpler machine learning methods. Text analysis is a complex field, so I picked a subfield called text classification. Experiments need data, therefore I included a part dedicated to their preprocessing. There is more than one language used in the experiments. Used languages are English, French and Czech, with more languages I could compare results of each method and model within the language and without the doubt I could tell which method performs the best for the language. I would like to mention a very good performance of the transformer models, they can perform surprisingly well even with small training dataset, in most cases they even outperformed deep learning methods trained on tens of thousands training samples.
The thesis as itself is a summary of the different machine learning technique and architectures applied on a 4 different datasets:
- IMDB
- English Twitter
- French Twitter
- CSFD (Similar to IMDB but Czech)
The used techniques are:
- Support Vector Machine - Simpler ML model for benchmarking against the deep learning models
- Recurrent neural networks (Own architecture)
- Recurrent neural networks (Using pretrained FastText embeddings)
- Transformers - BERT, camemBERT, distilBERT, roBERTa, small-e-czech, XLM-roBERTa
Used framworks:
- Scikit-learn - Support Vector Machine
- Tensorflow, Keras - Recurrent neural networks
- PyTorch - Only for Transformers
Directory:
folder CSFD:
- Jupyter notebook BERT_CSFD.ipynb containing all the experiments for the method using Transformers
- Jupyter notebook FastText_CSFD.ipynb containing all the experiments for the method using RNNs with FastText embeddings
- Jupyter notebook RNN_CSFD.ipynb containing all the experiments for the method using my own RNN architecture
- Jupyter notebook SVM_CSFD.ipynb containing all the experiments for the method using SVMs
- Jupyter notebook CSFD_Dataset.ipynb containing all the preprocessing steps and the description of the data
All directories for all datasets looks the same
The sizes for training set may vary, that is because the tables are showing the best results for each method and dataset, if you wanna find out which method was best in average or how does the architecture looked like, checkout jupyter notebooks and the thesis.
Datasets can be found on my GDrive here: https://drive.google.com/drive/folders/1OmB3FlDFeXJKrOVyl0JPx_GIUN4RT0xB?usp=sharing
Method | Train set size | Training time (s) | Accuracy | F1-score |
---|---|---|---|---|
SVM | 25 000 | 1575 | 84.29% | 0.842 |
BERT | 25 000 | 13549 | 93.38% | 0.934 |
RNN model | 25 000 | 1363 | 86.92% | 0.874 |
FastText | 25 000 | 169 | 85.70% | 0.853 |
Method | Train set size | Training time (s) | Accuracy | F1-score |
---|---|---|---|---|
SVM | 25 000 | 266 | 72.27% | 0.726 |
BERT | 20 000 | 921 | 82.40% | 0.824 |
RNN model | 900 000 | 766 | 82.58% | 0.827 |
FastText | 900 000 | 783 | 81.98% | 0.825 |
Method | Train set size | Training time (s) | Accuracy | F1-score |
---|---|---|---|---|
SVM | 25 000 | 343 | 73.76% | 0.742 |
camemBERT | 40 000 | 3870 | 82.07% | 0.820 |
RNN model | 900 000 | 1026 | 81.27% | 0.815 |
FastText | 900 000 | 734 | 80.42% | 0.798 |
Method | Train set size | Training time (s) | Accuracy | F1-score |
---|---|---|---|---|
SVM | 25 000 | 411 | 88.70% | 0.886 |
small-e-czech | 50 613 | 855 | 89.55% | 0.898 |
RNN model | 50 613 | 561 | 90.04% | 0.902 |
FastText | 50 613 | 547 | 87.51% | 0.876 |
Since the thesis is written solely in Czech (Article is still TODO), only the results were here. Don't worry tho, the jupyter notebook comments are written in English.
If you use any part from the Thesis, please cite
@thesis{Jochymek2022,
author = {Lukáš Jochymek},
title = {Transfer learning pro analýzu textových dat},
address = {Ostrava},
year = {2022},
school = {Vysoká škola báňská – Technická univerzita Ostrava},
type = {Bakalářská práce},
urldate = {2023-03-20},
url = {http://hdl.handle.net/10084/147359},
}