Twitter Mining Project

This project is a ML/NLP library in Java for analyzing tweets and building predictive models. The predictive models are built to help election/ad/marketing campaigns dig into social media conversations (public opinions) in order to get insights for making intelligent decisions.

The project consists of four main packages and a resource directory:

  1. Algorithms package contains implementations of a few ml/nlp algorithms for running text analysis on tweets contents.
  2. Twitter package is designed to wrap twitter data regradless of the persistent layer that one uses to store/retrieve tweets.
  3. Runanalysis package is the interface for running ml/nlp algorithms.
  4. Utilities package is developed to provide a collection of helper classes for different analysis.
  5. Resources directory includes a few data sources used for tweets analysis such as stop words, training data for sentiment analysis and so on.

Packages Details:

Algorithms Package:

  1. LDA Algorithm: an implentation of Latent Dirichlet Allocation algorithm used for topic modeling.
  2. NaiveBayes Classifier: a customized version of Naive Bayes classifier for running sentiment analysis on tweets.
  3. TextAnalysis: a class for performing various text analysis such as computing word frequencies.
  4. TweetsStatistics: provides functionalities for computing basic statistics from tweets.

Twitter Package:

  1. Tweet: a representative class for tweets.
  2. TweetDate: a class for dealing with date range. This allows us to analyze tweets in a give time range.
  3. TweetsConstants: a class for constants and configuration parameters.
  4. TwitterDataSource: an interface designed to deal with different persistent layers.
  5. TwitterFileDataSource: an implementation of TwitterDataSource interface when persistent layer is raw File.
  6. TwitterMySqlDataSource: an implementation of TwitterDataSource interface when persistent layer is MySql DB.

Runanalysis Package:

  1. RunBayes: runs sentiment analysis on tweets using NaiveBayes class.
  2. RunLDA: runs topic modeling on tweets using LDA class.
  3. RunStatistics: runs basic statistics on tweets using TweetsStatistics class.
  4. RunTextAnalysis: runs text analysis on tweets using TextAnalysis class.
  5. ThreadPool & WorkerThread: a multi-threaded code for running analysis.

Utilities Package:

  1. DayIntervals: a class for reading day interval files and generating a list of day pairs.
  2. GenerateCsv: a class for generating a CSV file for post-processing and visualization steps.
  3. MapUtil: a class for printing a TreeMap data.
  4. Pair: a class for defining pair objects.
  5. SentimentLabel: sentiment labels.
  6. StopWords: a class for building stop words for NLP analysis.
  7. TimeZone: time zone class.
  8. TweetUtils: a helper class which has functionalities for cleaning/normalizing tweets.
  9. ValueComparator: a comparator class.

Tweets Data Schema:

This library requires your twitter data to be stored in a MySql database/table (i.e. politics/tweets). Schema of tweets table is shown below:

Field Type
id int(10) unsigned, PRI
timestamp int(10) unsigned
source varchar(40)
author varchar(20)
lat decimal(10,8)
lng decimal(11,8)
text varchar(140)
created at datetime

If you'd like to read more about this project, you should check Barack Obama or Mitt Romney: that's the question! web page. You can also check our published paper using this ML/NLP framework here: The Predictive Power of Social Media: On the Predictability of U.S. Presidential Elections using Twitter.

If you have any question about the code, contact me @ kDOTjahanbakhshATgmailDOTcom

Licence

Copyright (c) 2013 Black Square Media Ltd. All rights reserved.
(The MIT License)

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
'Software'), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.