/twitter-experiment

What about classify sentiments on twitter sample data?

Primary LanguageRuby

Twitter Experiment

What about classify sentiments on Twitter sample data?

ATTENTION

As the name says, it's experimental. Please, don't be a fool, ths is not to be used on production environment.

Copus Generation

Twitter Api Terms does not allow to share or resyndicate Twitter content, cause of that I will not do it.

However, its possible to generate a script to create a corpus, and i did that. The corpus generator uses Twitter stream. This script is composed of two parts, but before it you need to configure your environment:

  • Use Twitter Streaming API to download tweets.
foreman run forest_consume

That will consume Twitter Sample Stream and save on a MongoDB database. Trainable tweets will be flagged. It will never finish, you need to decide how big you wnat your corpus, and when you decided is enough, simple stop it.

To detect trainable_tweets I simple look to emoticons. If tweet has a happy or a sad emoticon, it's trainable tweet. This idea was not mine, I found it on 'Twitter as a Corpus for Sentiment Analysis and Opinion Mining' (A Pak, P Paroubek - LREC, 2010).

  • After that you neet to train the classifier.
foreman run forest_train

that will generate a folder bayes_data with yout train.

Configuration

Twitter Experiment need to authenticate on Twitter developers, because of that you need to export some variables. To handle that we use dotenv. So all you need to do is:

  • Copy env.sample to .env.
cp config/env.sample .env
  • Edit .env with your own keys

The Script saves Twitter data on MongoDB so you need to configure it.

  • Copy mongoid.sample to mongoid.yml
cp config/mongoid.sample mongoid.yml
  • Edit your config/mongoid.yml with your mongo variables.

Results

To validate the experiment, I created some statistics. For that:

  • I found a set of 4662 tweets.
  • Split them in 90% + 10%.
  • Trainned those 90% on Naive Bayes Classifier.
  • Classified those other 10% using the trainned classfier.

After that i got these results:

Index Grade
F1-Score 0.387479175558645
Accuracy 0.775160599571734
Recall 0.774870646948735
Precision 0.775224132863021
Matthews correlation 0.550321199143469

To reexecute the statistics you can do

foreman run forest_statistics

TODO

  • Handle negations by attaching negation particle

Eg.: I do not like fish: I do+not, do+not like, not+like fish