Sentiment analysis of text based on a dataset of Annotated Tweets. This code has been written as the sentiment module of the Speemo package
In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. 40 thousands of examples across 13 labels can be found here. The labels have been concentrated to 6 basic Ekman's emotions.
The model is composed of a Recurrent NN with 512 LSTM hidden units connected to 6 output units by a Linear Layer. Throughput the training, dropout ration of 0.5 was used on the LSTM units and Linear Layer.
Individual words were extracted from tweets using the Twitter Tokenizer from NLTK, later all numbers and hyperlinks were turned into tokens and the remaining words were lemmatized using WordNet lemmatizer. Each resulting word was converted to vectors using the GloVe embedding scheme.
Training was performed using an Adam SGD algorithm with early stopping. Best results achieved after 20 epochs.
The model is composed of an ensable of classification trees trained using a gradient boosting approach.
Individual words were extracted from tweets using the Twitter Tokenizer from NLTK, later all numbers and hyperlinks were turned into tokens and the remaining words were lemmatized using WordNet lemmatizer. Resulting phrases were converted ot vectors using the Bag of Words approach with tf-idf approach.
Training was performed the xgboost library with parameters found in model_xgboost.py.
- Python - 3.5 or above
- pyTorch - Machine Learning Toolkit
- torchtext - Data loaders and abstractions for text and NLP
- NLTK - The Natural Language Toolkit
- xgboost - XGboost library
GPL