Paper about evaluating BERT, RoBERTa, DistilBERT, ALBERT and XLNet for detecting stances of Fake News.
In this paper, the two datasets FNC-1 and FNC-ARC are used to finetune large pretrained NLP models to classify the stances of article bodies towards their respective headline.
The goal is to systematically analyze the following questions:
- How well do the models perform in general?
- How much hyperparameter tuning is necessary?
- Which of the models performs best?
The background of the paper is the Fake News Challenge which was held in 2017. More details can be found here.
In total, two datasets are used to finetune the five models. The first dataset comes from the Fake News Challenge itself, while the second dataset is an extesion that was created by Hanselowski et al. Both datasets consist of article bodies, headlines and class labels. The class label expresses the stance of the article body towards the headline. The article body can either Agree (AGR) or Disagree (DSG) with the headline, it can Discuss (DSC) it or be completely Unrelated (UNR).
Dataset | Data Source | Data Type | Instances | AGR | DSG | DSC | UNR |
---|---|---|---|---|---|---|---|
FNC-1 | Fake News Challenge Stage 1 | News articles | 49,972 | 7.4% | 1.7% | 17.8% | 73.1% |
FNC-1 ARC | Review of the Challenge | + User posts | 64,205 | 7.7% | 3.5% | 15.3% | 73.5% |
Step | Details |
---|---|
Concatenation | Headline + Article body |
Stop word removal | The, the, A, a, An, an |
Train-dev split | 80:20 |
In total, five models are examined and their implementation of HuggingFace is used.
Model | Publication Date | Published By | Idea in a Nutshell |
---|---|---|---|
BERT | Oct 2018 | Google AI Language | Bidirectional Encoders from Transformer |
RoBERTa | Jul 2019 | Facebook AI & University of Washington |
Pretrain BERT excessively |
DistilBERT | Aug 2019 | HuggingFace | Distill BERT |
ALBERT | Sep 2019 | Google Research & Toyota Technological Institute at Chicago |
Distill BERT |
XLNet | Jun 2019 | Carnegie Mellon University & Google Brain |
Permutation Language Model |
The evaluation is conducted in two steps.
In the first experimental setup, all models are trained for 2 epochs,
with a learning rate of 3e-5, a sequence length of 512 tokens, a batch size of 8 and a linear learning rate schedule. With this fixed setting of hyperparameters three runs were conducted per model and dataset. The first run freezes all layers except for the last two (pooling & classification layer). The second run finetunes all layers. The third run freezes all embeddings layers.
The second step consists of an extensive grid search over the hyperparameters learning rate, batch size, sequence length and learning rate schedule and covers the following grid:
Hyperparameter | ||||
Sequence length | 256 | 512 | ||
Batch size | 16,32 | 4,8 | ||
Learning rate | 1e-5, 2e-5, 3e-5, 4e-5 | |||
Learning rate schedule | constant, linear, cosine |
- RoBERTa performs best
- Encoder-based approach of RoBERTa beats autoregressive approach of XLNet
- Learning rate is most important hyperparameter
There are three main scripts:
- data_prep
- experiments
- grid_search
All three scripts are used via the command line.
To execute everything, first create a virtual environment and then install the necessary packages via pip3 install -r requirements.txt
Executing python3 data_prep.py takes the files
- train_bodies.csv
- train_stances.csv
- competition_test_bodies.csv
- competition_test_stances.csv
for the FNC-1 and FNC-1 ARC dataset and fully processes them.
The processed files can be found under data/processed.
For both datasets three files are created for training (train), evaluation (dev) and testing (test) respectively.
The main pre-processing steps are
- assign integer values 0,1,2,3 to the four classes AGR, DSG, DSC, UNR
- merge headline and article body
- remove stop words The, the, A, a, An, an by using the word tokenizer of NLTK
- create split into training and development by using the 80:20 split function of the FNC-1
The folder data/splits contains the ids for the training and evaluation (hold_out) instances.
Executing python3 experiments.py yields the evaluation of the three different freezing techniques when setting the corresponding freeze flag accordingly. All models are trained for two epochs only and evaluation is done with respect to the evaluation dataset.
Most important flags:
The --model flag defines whether to use bert, roberta, distilbert, albert or xlnet
The --model_type flag takes the specific pretrained model from HuggingFace, for example bert-base-cased for bert
The --num_epochs flag is set to a default value of 2 epochs and should not be changed
The --dataset_name flag can be used to switch between the FNC-1 and FNC-1 ARC dataset
The --freeze flag sets the freezing technique to be used. A choice between freezing all but the finetuned layers (freeze), freezing the embedding layers only (freeze_embed) and freezing nothing, id est finetuning all layers (no_freeze) is possible.
Since the experiment script has to be run several times for each model and dataset, an additional bash script is used to facilitate the handling which can be used via ./experiments.sh in the terminal.
Executing python3 grid_search.py is the script used that conducts the grid search over 48 hyperparameter combinations.
It uses the tune package.
Important: the current learning rate has to be set manually within the script in the search_space dictionary. The storage capacity of the virtual machine only allowed for saving 12 model combinations at the same time. Thus for each model and dataset, the script grid_search.py had to be run 4 times for each of the learning rates separately.
Go to Details on Initial Experiments script for details on the flags that can be set.
The difference between the experiments and the grid search scripts is that the latter relies on the use of tune to speed up training and to perform grid search.
In some cases, the grid_search didn't end for one run, in that case, the evaluation and testing step were performed separately in addition.