Smiley prediction on Twitter data :)

In this paper, we apply machine learning methods to Twitter data to predict if a message has a positive or a negative smiley.

We present four different types of models: a set of simple machine learning baseline models; two long-short term memory (LSTM) models using word2vec and GloVe embeddings respectively; transformer models; and a few-shot learning model using TARS.

Our proposed model is the one that uses CT-BERT language model which achieves 0.906 accuracy and 0.905 f1-score in the test set and it was placed at the third position of the respective AIcrowd competition (submission ID: 107963).

For a step-by-step guide to run all the experiments the project, please take a look at this notebook:

We strongly advice running the project with the above Colab notebook which offers free GPUs.

Step-by-step guide for local deployment

Getting started


Clone and enter the repository

git clone https://<YOUR USER>:<YOUR PASSWORD>@github.com/CS-433/cs-433-project-2-mlakes MLProject2
cd MLProject2

We recommend installing the dependencies inside a python virtual environment so you don't have any conflicts with other packages installed on the machine. You can use virutalenv, pyenv or condaenv to do that.

pyenv virtualenv mlproject2
pyenv activate mlproject2

Project dependencies are located in the requirements.txt file.
To install them you should run:

pip install -r requirements.txt

To install spacy dependencies please run the following:

python -m spacy download en_core_web_sm


The raw data can be downloaded form the webpage of the AIcrowd challenge:
The data should be located in the data/ directory in csv format.

To do this, move the zip file to the data directory and run

unzip data/twitter-datasets.zip -d data/

mv data/twitter-datasets/train_neg.txt data/train_neg.txt 
mv data/twitter-datasets/train_pos.txt data/train_pos.txt 
mv data/twitter-datasets/train_neg_full.txt data/train_neg_full.txt 
mv data/twitter-datasets/train_pos_full.txt data/train_pos_full.txt 
mv data/twitter-datasets/test_data.txt data/test_data.txt



The BiLSTM can be trained with glove and word2vec embeddings. In order to run these models, you need to create the vocabulary (word2vec) or download a pre-trained one (gloVe).


Constructs a a vocabulary list of words appearing at least 5 times.

python preprocessing_glove/pickle_vocab.py


You must download the pretrained embeddings from here or using wget:

wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
mv glove.twitter.27B.zip data/embeddings/glove.twitter.27B.zip
unzip data/embeddings/glove.twitter.27B.zip -d data/embeddings

TARS zero shot

wget https://nlp.informatik.hu-berlin.de/resources/models/tars-base/tars-base.pt
mv tars-base.pt saved_models/tars-base.pt


To train the model, you can run

cd src
python run.py --pipeline training 

To run a particular model, the name of the model can be passed as a parameter

cd src
python run.py --pipeline training \
              --model glove 

The following models can be trained:

  • tfidf : TermFrequency-Inverse Document Frequency
  • word2vec : BiLSTM using word2vec embeddings
  • glove : BiLSTM using glove embeddings
  • bert : Bidirectional Encoder Representations from Transformers (CT-BERT)
  • zero : Few shot learning model

To create the predictions, you can run

python src/run.py --testing

Complete pipeline

If no parameters are passed, bert model is trained and then the predictions on the test data are made.

python src/run.py 

Running with Docker

The project can be easily run in any virtual machine without the need to install any dependencies using our docker container.

  1. Make sure you have docker and git installed and running.

  2. Declare global variables REPO is availabe in Dockerhub: paolamedo/bert_notebook:latest

BUILD_DIR=/home/paola/Documents/EPFL/MLProject2 <location of the cloned repo>
  1. Run docker
docker run --rm -it -e GRANT_SUDO=yes \
--user root \
-p 8888:8888 \
-e JUPYTER_TOKEN="easy" \
-v $BUILD_DIR:/home/jovyan/work $REPO_URL
  1. You will now be able to open jupyter notebook and run notebooks/MLProject2_GAP.ipynb:

or run from the terminal

python src/run.py 

Testing the code

To test the code of the data transformations please run:

cd src
python test_preprocessing.py 
python test_data_cleaning.py 
python test_embeddings.py 

Project Architecture


Our paper regarding the methodology and the experiments of the proposed model is located under the report/ directory in pdf format.

Folder structure

The source code of this project is structured in the following manner.

├── README.md
├── requirements.txt
├── Dockerfile-notebook
├─docs/                        # report and project description
├─data/                        # the data directory
│   ├── embeddngs/             # dirctory where embeddings will be stored
│   └── twitter-datasets.zip   # This is where the data should be loaded
├── models/                    # directory where models are saved
├── predictions/               # directory where the predictions are saved
├── notebooks
│   └── MLProject2_GAP.ipynb
├── src
│   ├── models/                # directory with models' code   
│   ├── preprocessing_glove/   # directory with files to preprocess corpus for glove
│   ├── data_cleaning.py
│   ├── data_loading.py
│   ├── embeddings.py
│   ├── evaluate.py
│   ├── model_selection.py
│   ├── preprocessing.py
│   └── run.py
└── test                       # unit tests
   ├── test_data_cleaning.py
   ├── test_embeddings.py
   └── test_preprocessing.py


  • Angeliki Romanou @agromanou
  • George Fotiadis @geofot96
  • Paola Mejia @paola-md

To see the development of the project and the interesting discussions we had in each pull request, you can visit our development repository: https://github.com/geofot96/MLProject2/