Machine Learning Project 2

This is the second project of the EPFL Machine Learning course, Fall 2019. In the project, we are given a dataset containing 2.5 millions tweets. Half of the tweets are labeled with positive sentiment and the rest are negative. Our task is to predict 10000 unlabeled tweets in the testing set.

Dependencies
Files Description
Result
Steps to Reproduce our Result
Developers
References
License

Dependencies

We implemented in Python 3. You will need the following dependencies installed:

NLTK
```
$ pip install nltk
```
Gensim
```
$ pip install gensim
```
FastText
```
$ pip install fasttext
```
Torchtext
```
$ pip install torchtext
```
Transformers
```
$ pip install transformers
```
tqdm
```
$ pip install tqdm
```

Files Description

tfidf_word2vec/tf_idf.ipynb: Traning and testing procedure for simple ML models using TF-IDF matrx.
tfidf_word2vec/word2vec.ipynb: Traning and testing procedure for simple ML models using Word2Vec matrx.
tfidf_word2vec/helpers_simple_ml.py: Helpful functions used in tf_idf.ipynb and word2vec.ipynb.
bagging.ipynb: Simple voting (could be used after training and testing in bert_based.ipynb).
bert_based.ipynb: Traning and testing procedures for BERT based models.
fasttext.ipynb: Traning and testing procedures for fasttext based model.
helpers.py: Useful helper functions for the fasttext model
run.py: Codes to reproduce our result

Result

AIcrowd competition link: https://www.aicrowd.com/challenges/epfl-ml-text-classification-01b777b0-a83a-412a-b6f8-f3dc53cb1bce
Group name: TWN1
Leaderboard
- 0.909 of categorical accuracy.
- 0.909 of F1 score.

Steps to reproduce our result

There are two methods to reproduce our result

Use trained models to get predictions and vote from them to get our best prediction. It takes about 2.5 hours to run with CPU. With GPU, it can be faster.)
Vote from predictions to get our best prediction. It takes only a few seconds to run.

Here are the steps:

If you choose the second method skip step 2. and step 3.
Download trained models through this Google Drive links and put them in a folder called models
Change the test_data_dir argument (the directory of testing data)

Execute the following command

$ python3 run.py --test_model --test_data_dir 'data/test_data.txt'
or
$ python3 run.py --test_predictions

The prediction will be saved as best_prediction.csv

Developers

@Kuan Tung @Chun-Hung Yeh @De-Ling Liu

References

For BERT based models:

License

Licensed under MIT License

dinotuku/TweetsSA