This repo builds, trains and exposes a system that performs sentiment classification, i.e. identifying how positive a given sentence is. The system has been trained on the open-source Stanford Sentiment Treebank dataset
The easiest way to use the FastAPI HTTP endpoint is to download and run the following Docker image:
docker pull tommasobonomo/sentiment-classifier:inference
docker run --net=host tommasobonomo/sentiment-classifier:inference
This way the weights and configurations for the two proposed models are included directly in the Docker image, which is ready to run.
An HTTP endpoint will be available at localhost:8000, with readable docs at localhost:8000/docs.
To extract features from the raw text, I implemented a Term-frequency inverse-document frequency algorithm (TfIdf) to build a vocabulary from the corpus of training sentences. I then applied a dimensionality reduction algorithm (truncated SVD) to reduce the dimensionality of the sentences encoded following the vocabulary. This preprocessing combination is usually known as Latent Semantic Analysis (LSA).
I then applied a standard XGBoost classifier, with some hyperparameter tuning that brought an increase in F1 score of around 0.05.
XGBoost hyperparameter tuning
Through the Hydra package used to manage configurations in this repository, it is possible to run a hyperparameter sweep on a series of parameters. Below I reported the parameters and intervals that I optimized, through the Ax Sweeper plugin for Hydra.
python -m scripts.fit_and_evaluate --multirun hydra/sweeper=ax \
'xgboost_config.n_estimators=int(interval(5, 100))' \
'xgboost_config.max_depth=int(interval(1, 10))' \
'tfidf_config.output_dims=int(interval(5, 100))' \
'tfidf_config.max_ngram_range=int(interval(1, 3))'
and the final best hyperparameters reported by the Bayesian Optimization algorithm:
{
'xgboost_config.n_estimators': 95,
'xgboost_config.max_depth': 7,
'tfidf_config.output_dims': 49,
'tfidf_config.max_ngram_range': 2
}
I also implemented a transformer-based solution that uses a pre-trained Transformer encoder (in this case DistilBERT) with a classification head that can classify the whole given sentence.
I finetuned this architecture on the given Stanford Sentiment Treebank dataset, evaluating a few different hyperparameter choices on the dev
split of the dataset.
The final metrics obtained on the test
split are:
Model | F1-score | Accuracy | Precision | Recall |
---|---|---|---|---|
TfIdf + XGBoost | 0.6095 | 0.6033 | 0.5844 | 0.6370 |
DistilBERT | 0.8461 | 0.8541 | 0.8685 | 0.8248 |
where all metrics are considered in a binary classification scenario. DistilBERT performs much better than the baseline.
A brief data exploration notebook is provided in notebooks/eda.ipynb
. It should be viewable as-is, but it could also be re-run if necessary.