What about classify sentiments on Twitter sample data?
As the name says, it's experimental. Please, don't be a fool, ths is not to be used on production environment.
Twitter Api Terms does not allow to share or resyndicate Twitter content, cause of that I will not do it.
However, its possible to generate a script to create a corpus, and i did that. The corpus generator uses Twitter stream. This script is composed of two parts, but before it you need to configure your environment:
- Use Twitter Streaming API to download tweets.
foreman run forest_consume
That will consume Twitter Sample Stream and save on a MongoDB database. Trainable tweets will be flagged. It will never finish, you need to decide how big you wnat your corpus, and when you decided is enough, simple stop it.
To detect trainable_tweets I simple look to emoticons. If tweet has a happy or a sad emoticon, it's trainable tweet. This idea was not mine, I found it on 'Twitter as a Corpus for Sentiment Analysis and Opinion Mining' (A Pak, P Paroubek - LREC, 2010).
- After that you neet to train the classifier.
foreman run forest_train
that will generate a folder bayes_data
with yout train.
Twitter Experiment need to authenticate on Twitter developers, because of that you need to export some variables. To handle that we use dotenv. So all you need to do is:
- Copy env.sample to .env.
cp config/env.sample .env
- Edit .env with your own keys
The Script saves Twitter data on MongoDB so you need to configure it.
- Copy mongoid.sample to mongoid.yml
cp config/mongoid.sample mongoid.yml
- Edit your config/mongoid.yml with your mongo variables.
To validate the experiment, I created some statistics. For that:
- I found a set of 4662 tweets.
- Split them in 90% + 10%.
- Trainned those 90% on Naive Bayes Classifier.
- Classified those other 10% using the trainned classfier.
After that i got these results:
Index | Grade |
---|---|
F1-Score | 0.387479175558645 |
Accuracy | 0.775160599571734 |
Recall | 0.774870646948735 |
Precision | 0.775224132863021 |
Matthews correlation | 0.550321199143469 |
To reexecute the statistics you can do
foreman run forest_statistics
- Handle negations by attaching negation particle
Eg.: I do not like fish: I do+not, do+not like, not+like fish