Coding Assignment for BH RA Application - NLP for tweet sentiment classification. We build a simple pipeline to process tweets and create binary classification models.
A
requirements.txt
is available to set up a clean environment. After that, runningmain.sh
should recreate the results.
The repo is structured as follows:
01_Data/
: This directory contains the dataset files used for training and testing the models.Raw
: Contains the raw datasets as downloaded from KaggleProcessed
: Contains processed training, validation and testing datasets
02_Code/
: This directory contains the source code for the tweet processing pipeline and classification models.data_prep.py
: Contains the pipeline to clean the dataset, generate features and create training, validation and testing setsmodeling.py
: Contains pipeline to fit and evaluate modelsmain.sh
: Orchestrator that runs the rest of the scriptsgenerate_report_figures.py
: Small script to generate graph based on results
03_Results/
: This folder stores the final report, as well as the figures and tables that feed itReport.md
: Final report on project- Other figures and csv
04_Notebooks
: Empty, used it to store scratchpad notebooks