/Twitter-Hashtag-Tracking

Twitter hashtag tracking, analysis and classification.

Primary LanguagePythonMIT LicenseMIT

Python 3.4 license release v1.4

Twitter Hashtag Tracking

Motivation

Track specific hashtags or keywords in Twitter, and do real-time analysis on the tweets.

Run Example

Configuration

Set your own src/config.json file to get Twitter API access.

{ "asecret": "XXX...XXX",
  "atoken":  "XXX...XXX",
  "csecret": "XXX...XXX",
  "ckey":    "XXX...XXX"

Modify the conf/parameters.json file to set the parameters.

{ "hashtag": "#overwatch",
  "DStream": { "batch_interval": "60",
               "window_time": "60",
               "process_times": "60" }
}

Suggestion: Set batch_interval and window_time the multiple of 60.

MongoDB Database

Start a mongod process

sudo mongod

Model Training

Run Spark jobs to train a Naive Bayes model for later sentiment analysis.

$SPARK_HOME/bin/spark-submit src/model.py > log/model.log

You can check the accuracy of the trained model in log/model.log:

>>> Accuracy
0.959944108057755

Twitter Input

Wait for connection to start streaming tweets.

python3.4 src/stream.py

Spark Streaming

Run Spark jobs to do real-time analysis on the tweets.

$SPARK_HOME/bin/spark-submit src/analysis.py > log/analysis.log

Dashboard

Run the data visualization jobs.

python3.4 web/dashboard.py

Process

Twitter API

  • Use Twitter API tweepy to stream tweets
  • Filter out the tweets which contain the specific keywords/hashtag that we want to track.
  • Use TCP/IP socket to send the fetched tweets to the spark job

Real-time Analysis

  • Use Spark Streaming to perform the real-time analysis on the tweets
  • Count the number of related tweets for each time interval
  • Tweet context preprocess
    • Remove all punctuations
    • Set capital letters to lower case
    • Remove stop words for better performance
  • Find out the most related keywords
  • Find out the most related hashtags
  • Sentiment analysis
    • Use Spark MLlib to build a Naive Bayes model
    • Classify each tweet to be positive/negative
    • Training examples from Sanders Analytics

Database

  • Use MongoDB to store the analysis results

Visualization

The Dashboard.

Time line of related tweet counts, most related hashtags, most related keywords, the ratio of postive/negative tweets.

Prerequisite

Resources

License

See the LICENSE file for license rights and limitations (MIT).