Machine learning to predict the sentiment of tweets
Fill out example_settings.py
with your own credentials and options, then rename the file to settings.py
Make sure python3 is installed
Extract data.zip
into the data/
directory and ensure any paths referenced in main.py
are accurate.
Create a virtualenv by running
$ virtualenv venv -p python3
Activate the virtualenv by running
$ source ./venv/bin/activate
Install requirements with pip
$ pip install -r requirements.txt
Run the sklearn model
$ python main.py
Run the spark model
$ python spark.py
Two classifiers are running here, sklearn and Apache Spark examples of the Multinomial Naive Bayes classifier. The MultinomialNB is the classic choice for text analysis in machine learning.
The best way to see what's happening behind-the-scenes is to look at the sklearn implementation in classifier.py
, following the main.py
entrypoint.
In main.py
, readdata()
is called twice, producing two raw datasets, having undergone no preprocessing. In effect, this is the tweet text and the 'sentiment rating' (1 for positive sentiment, 0 for negative) for each. Next, the classifier is trained with the call to classifier.train()
.
In classifier.py
, the train()
function takes data
(tweet text) and labels
(sentiment rating) as arguments and creates three new objects, a MultinomialNB
classifier, a CountVectorizer
to produce a word count vector from each tweet, and a TfidfTransformer
that converts the count vector's innards into a frequency vector. The array representation of this output is also known as LibSVM format, which will be important later!
Next, a call to fit_transform()
fits the vectorizer and transformer with each tweet. The model will use this fitted object to make predictions on supplied data. The model is fitted with the frequency vector and the sentiment ratings of each tweet.
The test()
function takes a similarly structured set of data and asks the model to predict the sentiment rating based on its training and returns the accuracy of the test results. The trained models can be saved and reloaded using save()
and load()
and the options in main.py
.
The same processes are mimicked in spark.py
using the Apache Spark library. The preprocessing of data here is the main difference; each tweet needs to be converted to LibSVM format, then to a LabeledPoint
object, which then must be packaged in an RDD (Resiliant Distributed Dataset). We use classifier.py
to compute the LibSVM array of each tweet and do the conversions. The data is then fed into the model in train()
and tested in test()
in the same way as in its sklearn counterpart.