Machine Learning for Binary Text Classification

CS-433 Project 2, 2021, EPFL Text classification

Author:

Fridtjof Storm Flaate | fridtjof.flaate@epfl.ch

Submission:

#169479

Abstract:

The goal of this project was to build a model that could accurately classify tweets as either positive or negative. In this project, you will find six different models. Three classic machine learning models and three neural networks. The best performing model is the neural network using the pre-trained bidirectional encoder representations from transformers, also called BERT. The transfer-learning model gave us an accuracy of 89.3% and an F1 score of 89.6%.

Setup:

This is a step by step guide of how you can setup up your environment to run the run.py that will create the submission file.

Prerequisites

conda
pip3
python3
Download 'epfml-text' from here, unzip and add to /twitter-datasets folder.

Installation

Clone the repo and enter directory text_classification

git clone https://github.com/StormFlaate/text_classification

create environment
```
conda create --name text_classification
```
activate environment
```
conda activate text_classification
```

install dependencies

conda install --file requirements.txt && conda install -c huggingface transformers

install dependencies
```
pip3 install -r requirements_pip.txt
```

Overview:

Setup files:

requirements.txt: file to install conda dependencies
requirements_pip.txt: file to install python3 dependecies
README.md: file containing information about the project

Machine learning models:

SGD log loss: trianing and testing of model
Logisitic regression: training and testing of model
Random forest: training and testing of model
NN GloVe: Neural network with aggregated GloVe word embeddings
NN sentence transformer: NN with all-MiniLM-L6-v2 sentence embedding
NN transfer learning BERT: Transfer learning model BERT

Run files:

run.py: file containing everything to recreate best submission

Helper functions:

helper functions: contains all helper functions and classes used in the project

Folders:

twitter-datasets: will contain all data-sets used for this project - need to be downlaoded manually.