/habakkuk

Habakkuk is an application for filtering tweets containing Christian Bible references.

Primary LanguagePython

Habakkuk

Habakkuk is an application for filtering tweets containing Christian bible references. The goal is to capture the book name, chapter number, verse number and tweet text for further analysis.

Dependencies

This project requires postgresql

sudo apt-get install postgresql-8.4 postgresql-client-8.4 postgresql-server-dev-8.4
sudo apt-get install build-essential python-devel
sudo add-apt-repository ppa:chris-lea/node.js
sudo apt-get update
sudo apt-get install nodejs
sudo npm install karma

coverage run --source='.' manage.py test web topic_analysis
coverage report

Web app

This is the frontend code for http://bakkify.com.

Django

This project uses django. Perform the following to set up the virtual environment.

$ virtualenv .
$ . ./bin/activate
$ pip install -r requirements.txt

Angular

This project uses angularJS and karma for JS unit testing. To test...

# install dependencies
karma start

Topic Analysis

The topic analysis is based on the NMF topic extraction example. It performs kmeans clustering on velocity features for bibleverses. Then it applies the NMF analysis to extract topics from text for each cluster. Finally, it uses hierarchical clustering to filter (nearly) duplicate topics and rank the topics.

Real-time processing

This project uses a storm topology to analyze tweets from the twitter sample stream. The entry point is a storm spout that uses twitter4j to access the stream with a username and password. Tweets are then passed to a storm shell bolt implemented in Python that applies a regular expression for detecting Christian bible references. Finally, a bolt receives the tuple with a bible reference tag and stores it to elasticsearch.

For more information refence the storm concepts wiki. I also have a habakkuk starter page that provides some background.

Data Stores

Elasticsearch

This project uses ElasticSearch as backend storage. Please reference the site for details.

Accumulo

I experimented with using Apache Accumulo. The code has been disabled but the Bolt is still there is anyone wants to try it. It works fine but I found Elasticsearch worked better for this project.

Hadoop

Scripts in analysis/ depend on Cloudera Hadoop CDH3.

Sub-Directories

  • java - Storm Application
  • bible_verse_matching - Tools to build and test the bible reference regular expressions. Also dictionary files for pig and mahout.
  • elasticsearch - Index templates and tools to query elasticsearch
  • accumulo - Table initialization scripts
  • config - Configuration files for setting up storm with supervisord
  • analysis - pig scripts for data analysis
  • web - web front-end
  • topic_analysis - topic modeling using scikit-learn