Troll or Not Troll?

Troll or Not Troll?

Intro

The field of NLP was revolutionized in the year 2018 by introduction of BERT and his Transformer friends(RoBerta, XLM etc.).

These novel transformer based neural network architectures and new ways to training a neural network on natural language data introduced transfer learning to NLP problems. Transfer learning had been giving out state of the art results in the Computer Vision domain for a few years now and introduction of transformer models for NLP brought about the same paradigm change in NLP.

Motivation

Despite these amazing technological advancements applying these solutions to business problems is still a challenge given the niche knowledge required to understand and apply these method on specific problem statements. Hence, In the experiment I will be testing a few transformer architecture, on a relatively big dataset, and how it performs on shorter text with a humongous corpus.

Before i proceed i would like to mention the following groups for the fantastic work they are doing and sharing which have made these experiments possible.

Please review these amazing sources of information:

Approach

I'll be looking at a few models, and compare it with the standard models along the way. Then all of the models will be evaluated with accuracy_score and matthews_corrcoef. The models that are tested are:

distilbert-base-multilingual
bert-base-multilingual
roberta-base
electra-small
electra-base

Methods

preprocessing

the dataset below has been preprocessed by removing:

punctuations
non alphanum character (numbers excluded) see this script

Dataset Exploration

all the these methods are inside the [this notebook][notebook/eda.ipynb] with the total of 89k tweets, which are splitted in to train and test, and 2 classes, we can see the distribution between the classes are balanced. whereas class 0 has 44939 tweets, and class 1 has 45009 tweets

word length

by eyeballing on the length of each tweet, we can see that the tweet legnth on each class has a distinguishable differences.

hyperparameters

Results and conclusion

All the training performance, losses and metrics are recorded to wandb. Results are here if you prefer to see it for yourself.

accuracy score

baseline metric for prediction, by comparing its true label and the prediction result

mcc

Matthews correlation coeffienct takes account true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes.

The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient.

training time

as you can see roberta, takes the longest to train (~ 9 hours) while bert and electra are way faster (~1 hour).

conclusion

in this experiment we can see that roberta-base almost outpeforms every other model although ELECTRA came pretty close despite almost half the parameters trained. while the GLUE Benchmark results has electra-base as the state of the art, it took a while to find a good hyperparameters by using a good annealing factor. roberta-base gain advantage on shorter text, with a big number of corpus. Meanwhile electra-base came in pretty close in the second place, with a good hyperparameter optimization, it could surpass roberta-base for sure. While roberta-base excels in this kind of task, roberta-base is really resource intensive compared to electra-base

svmihar/transformers_state_trolls_cch