This is a project that was done in the Computational Intelligence Lab 2020 at ETH Zurich (see Course website).
Specifically we are doing sentiment analysis of tweets and classify them into positive and negative sentiments (see Kaggle competition). Our team name was FrontRowCrew
and we were eventually ranked first out of 35 teams on the private test set.
The experiments were run and tested with Python version 3.7.1.
Download data from: http://www.da.inf.ethz.ch/teaching/2018/CIL/material/exercise/twitter-datasets.zip
clone project
git clone https://github.com/ferraric/Computational-Intelligence-Lab-2020.git
install project
cd Computational-Intelligence-Lab-2020
move data into the data folder
mkdir data
mv path-to-downloaded-folder/twitter-datasets/* data
before running make sure that the source directory is recognized by your PYTHONPATH, for example do:
export PYTHONPATH=/path_to_source_directory/Computational-Intelligence-Lab-2020:$PYTHONPATH
If on Leonhard: install your python virtual environment into Computational-Intelligence-Lab-2020/venv
python3 -m venv ~/Computational-Intelligence-Lab-2020/venv
source ./init_leonhard.sh
If local:
pip install -r requirements.txt
Note that the pytorch version we used (pre-compiled with a specific cuda version) is not available for macOS. If you want to run it locally on a mac, change the pytorch version in the requirement file to the following:
torch==1.5.0
To log the experiment results we used Comet (https://www.comet.ml/docs/), a tensorboard like logger. Unfortunately we cannot make access to our experiment logs public. However, if access to the logs is needed, contact jeremyalain.scheurer@gmail.com. If you would like to use your own comet account to run the experiments, fill the comet related config options with your account credentials.
All experiments can be run with the config option "use_comet_experiments": false, which is the default option. In that case, the logs and saved predictions are found in the same directory where the model checkpoint is saved. That path is the concatenation of the config option "model_save_directory", the config option "experiment_name" and the timestamp of execution start time.
Example: experiments/bert-baseline/20-07-25_12-25-02/
To calculate the accuracy of a prediction file, run the following command:
python utilities/calculate_accuracy.py -p path-to-predictions.csv -l path-to-labels.csv
Note that the predictions should be formatted as specified in the sample_submission.csv
file.
The following holds for all models except Google Natural Language API and GloVe:
The hyperparameters epochs
, max_tokens_per_tweet
, validation_size
, validation_split_random_seed
, batch_size
, n_data_loader_workers
, learning_rate
were unchanged for all runs with a particular model. How other hyperparameters were varied is described in the following sections.
When running an experiment, at the end of training, the provided test data are automatically predicted with the best saved checkpoint. If one needs to predict a set of tweets from an existing checkpoint, one needs to point the config option "test_tweets_path" to the corresponding tweets and provide the model checkpoint via the argument -t. Example:
python mains/bert.py -c configs/bert.json -t path-to-model-checkpoint/model_checkpoint.ckpt
Note that one needs a Google cloud account and credits to use this service. Make sure you set the variable GOOGLE_APPLICATION_CREDENTIALS to point to the json file containing your account credentials. Also as a disclaimer we want to note that the Google Natural Language API is a service that costs money. One usually has a certain amount of free calls to the API (which we used). But we want to remind you to first check out what your free API call limitations are.
export GOOGLE_APPLICATION_CREDENTIALS=path-to-your-account-credentials.json
python baselines/google_nlp_api.py -c configs/google_nlp_api.json
To run the experiments for GloVe, download the code from https://github.com/dalab/lecture_cil_public/tree/master/exercises/2019/ex6 to generate the vocabulary and training the word embeddings.
In the README of that page, the instructions are given to build the vocabulary and the co-occurrence matrix for training. Note that modifications have to be made to build_vocab.sh
and cooc.py
to use the full dataset.
After building the co-occurrence matrix, run the training of the word embeddings by executing
python glove_solution.py
In that script, the number of epochs can be specified.
After this is done, move both the generated vocab.pkl
and the embeddings.npz
files to the data folder defined in the Setup section.
To train a classifier using the GloVe embeddings, run:
python baselines/main_glove.py -c configs/glove_embeddings_{logregression, decisiontree, randomforest}_classifier.json
The grid search parameters can be modified inside the respective config files.
python mains/bert.py -c configs/bert.json
The scores provided in the report were the average across runs with 5 different random seeds. The random seeds we used were [0, 1, 2, 3, 4], they were set via the config option random_seed
.
For the ablation study, we ran 3 models:
python mains/roberta.py -c configs/roberta.json
For this run, the option use_special_tokens
should be set to false. You can then execute:
python mains/bertweet.py -c configs/bertweet.json
Set the option use_special_tokens
to true again, then do:
python mains/bertweet.py -c configs/bertweet.json
All runs were repeated with random_seed
in [0, 1, 2, 3, 4].
Download and extract the folder from http://cs.stanford.edu/people/alecmgo/trainingandtestdata.zip. Run the following preprocessing script:
python data_processing/preprocess_additional_data.py -i path-to-downloaded-folder/training.1600000.processed.noemoticon.csv -o output_folder
To run a model, use the same command as described above and set the config options use_additional_data
to true. Also set additional_negative_tweets_path
and additional_negative_tweets_path
to the respective files generated in the output folder from the preprocessing script.
We used the 5 runs from the BERTweet section and gathered all class output probabilities (logged during prediction of the test set). Place the runs to ensemble (we sequentially averaged rs0 then rs0+rs1, then rs0+rs1+rs2, ...) inside a directory "input_directory". Run
python ensemble/ensemble_probabilities.py -i input_directory -o output_directory
Inside "output_directory" a file "ensemble_predictions.csv" will be generated.
For bagging, one needs to train multiple models with the option "do_bootstrap_sampling" set to true. Then proceed as described in the simple model averaging section.
Repeat the following procedure for BERT and BERTweet:
-
Train the model as described in the corresponding model's section.
-
Train the model on tweets with the unmatched parentheses relevant to the Parentheses Rule removed. This is done by setting the config option
remove_rule_patterns
to true while keeping all other options the same.
Each model will save the validation data and labels in the model checkpoint directory (see beginning of Reproduce Results section)
-
Evaluate the saved model from step 1 first on the validation data saved in step 1 and then on the validation data saved in step 2.
-
Evaluate the saved model from step 2 first on the validation data saved in step 1 and then on the validation data saved in step 2.
This will leave you with 4 prediction files. For each of these, perform step 5:
- This will output confusion matrices and accuracies for both the Parentheses Rule and the given validation predictions.
rules/main.py -d "path_to_saved_valiation_tweets_saved_in_step_1" -l "path_to_saved_valiation_labels_saved_in_step_1" -p "path_to_saved_valiation_predictions"
All experiments of BERT, Roberta and BERTweet were run on ETH's Leonhard cluster using an Nvidia GeForce RTX 2080 Ti GPU. The runtimes per model were about 16 hours (26 hours with additional data) with 2 CPU cores and 64 GBs of memory for BERT and BERTweet and 96 GBs of memory for RoBERTa.
We would like to give credit to the following tools, libraries and frameworks that helped us during this project:
- We used the transformers library from huggingface (https://github.com/huggingface/transformers) for all of our transformer based models.
- To structure our code and abstract from some pytorch details, we used the pytorch lightning framework (https://github.com/PyTorchLightning/pytorch-lightning).
- For continuous integration we used travis-ci (https://travis-ci.com/)
- For formatting, enforcing coding style and a weak form of type checking we used different pre-commit hooks (https://pre-commit.com)