/twitter-sentiment-analysis

Sentiment analysis of twitter posts

Primary LanguagePythonGNU General Public License v3.0GPL-3.0

Sentiment Analysis of Tweets

Sentiment analysis of text based on a dataset of Annotated Tweets. This code has been written as the sentiment module of the Speemo package

Dataset

In a variation on the popular task of sentiment analysis, this dataset contains labels for the emotional content (such as happiness, sadness, and anger) of texts. 40 thousands of examples across 13 labels can be found here. The labels have been concentrated to 6 basic Ekman's emotions.

Model: RNN

The model is composed of a Recurrent NN with 512 LSTM hidden units connected to 6 output units by a Linear Layer. Throughput the training, dropout ration of 0.5 was used on the LSTM units and Linear Layer.

Pre-processing

Individual words were extracted from tweets using the Twitter Tokenizer from NLTK, later all numbers and hyperlinks were turned into tokens and the remaining words were lemmatized using WordNet lemmatizer. Each resulting word was converted to vectors using the GloVe embedding scheme.

Training

Training was performed using an Adam SGD algorithm with early stopping. Best results achieved after 20 epochs.

Model: Gradient Boosted Trees

The model is composed of an ensable of classification trees trained using a gradient boosting approach.

Pre-processing

Individual words were extracted from tweets using the Twitter Tokenizer from NLTK, later all numbers and hyperlinks were turned into tokens and the remaining words were lemmatized using WordNet lemmatizer. Resulting phrases were converted ot vectors using the Bag of Words approach with tf-idf approach.

Training

Training was performed the xgboost library with parameters found in model_xgboost.py.

Dependencies

  • Python - 3.5 or above
  • pyTorch - Machine Learning Toolkit
  • torchtext - Data loaders and abstractions for text and NLP
  • NLTK - The Natural Language Toolkit
  • xgboost - XGboost library

License

GPL