#W205 Class Project #Analytic Engine for Movies
Team member: Ravneet Ghuman, Guillaume De Roo, Varadarajan Srinivasan, Qian Yu
Section 1. Key Files and Descriptions
Twitter Movie Data
- extract_tweets_restful.py -- Program that extracts one time batch pull of movie tweets and queues it to Kafka Messaging system
- extract_tweets_streaming.py -- Program that streams live tweets & sends it to Kafka consumer
- Kafka_Consumer.py -- Kafka_Consumer program that consumes the tweets both from restful and streaming api. Json response received by the consumer is analyzed, features extracted, cleansed and loaded into postgres database.
- create_tweets_table.sql -- SQL script to create the table structures to persist cleansed data from Kafka_Consumer preprocessing.py -- Preprocessing program combines several off the shelf functions available in nltk, re, Itertools, Inspect, String etc. into a data cleansing module
- textfeatures.py -- Textfeatures program combines functions from pandas, gensim and sklearn to create vectorization, feature extraction and sentiment scoring capabilities
- utils.py -- Utils function offers a logger
- testapi.py -- Testapi program extracts movie names from hive tables, transforms names into hashtags and passes that onto batch and streaming producer scripts
IMDB Movie Data
- get_new_movies.py -- a program to query IMDB restful API with IMDB movie IDs and write data to postgres table, and a .csv file
- search_movie.py -- a utility program to query IMDB restful API with movie titles
Tableau Dashboard
- Dashboard.twb -- a dashboard transforming the databases into a report useful to analysts
Section 2. Directory Structure
/twitter/
extract_tweets_restful.py
extract_tweets_streaming.py
kafka_Consumer.py
create_tweets_table.sql
/textprocessing/
preprocessing.py
textfeatures.py
utils.py
/rest-api/
app.py
/imdb/
get_new_movie.py
search_movie.py
store_data_hive.sql
Section 3 a. Pre-requisites
PYTHON setup Download the Python Twitter tools at https://pypi.python.org/pypi/twitter. if needed - sudo apt-get install python-setuptools
python setup.py build python setup.py install
pip install flask pip install flask-restful pip install sqlalchemy
TWITTER setup
Create an app on https://apps.twitter.com/ and then create auth tokens
ZOOKEEPER Setup
Download zookeeper from http://www.apache.org/dyn/closer.cgi/zookeeper/
####start zookeeper sudo bin/zkServer.sh start
KAFKA Setup
Download kafka from
https://kafka.apache.org/
####start kafka bin/kafka-server-start.sh config/server.properties
start kafka-client if you want to view messages are being written
bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic Twitter --from-beginning
Data Analytics Set-up & steps to address Vader Package related issues
A. Install the following python packages first:
1. pip install kafka
2. pip install pandas
3. pip install json
4. pip install textblob
5. pip install nltk
6. pip install gensim
7. pip install spacy
8. pip install re
9. pip install psycopg2
10. pip install django
11. pip install Ipython
12. pip install re
13. pip install codecs
B. Download Vader Sentiment Lexicon (vader_lexicon.txt) from this git page https://github.com/cjhutto/vaderSentiment/tree/master/vaderSentiment
C. Copy the vader_lexicon.txt into the following directory
cd ~/anaconda/lib/python2.7/site-packages/nltk/sentiment
if your anaconda folder name is named as anaconda2 then:
cd ~/anaconda2/lib/python2.7/site-packages/nltk/sentiment
D. Edit the vader.py in the same directory as follows:
1. In the class SentimentIntensityAnalyzer(object) in and around line 196:
a. First modify the path name of lexicon_file in def __init__ function to absolute path to vader_lexicon.txt file.
As an example: def __init__(self, lexicon_file="/home/w205user/anaconda2/lib/python2.7/site-packages/nltk/sentiment/vader_lexicon.txt")
b. In the same class, modify one line in the function make_lex_dict:
(i) for line in self.lexicon_file.split('\n'): should be changed to for line in self.lexicon_file.split('\n')[:-1]:
c. save and exit
E. Import additional nltk packages:
1. Start a python shell by typing python on command line:
a. In the python shell type following commands:
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
2. If you get any other missing nltk packages error, use the same method above to download those packages.
TABLEAU Setup
1. Download Tableau Desktop, version 10.0.2 on your desktop
a. Trial version may not have ability to publish to server, so use student license if possible)
2. Set up a Tableau Server
a. Accept the invitation sent to you for https://10az.online.tableau.com/t/w205movieproject/
b. Create a new Tableau Server (there is a free trial version)
Section 3 b. Steps to Run
Twitter Steaming and Batch Data Ingestion
1. Start Kafka zookeeper
/$path/kafka_2.11-0.10.1.0/bin/zookeeper-server-start.sh /$path/kafka_2.11-0.10.1.0/config/zookeeper.properties
2. Start kafka server
/$path/kafka_2.11-0.10.1.0/bin/kafka-server-start.sh /$path/kafka_2.11-0.10.1.0/config/server.properties
3. Launch API for movie names
> nohup python app.py &
Launch kafak producers
Restful Tweets
> nohup python extract_tweets_restful.py &
Streaming Tweets
> nohup python extract_tweets_streaming.py &
4. Define a tweet data table in postgres
> python create_tweets_table.sql
5. Launch kafka consumer
> nohup python kafka_Consumer.py &
IMDB Data Query
Query IMDB Restful API
1. Install IMDBPy
> pip install imdbpy
2. Query movies with sample program
> python get_new_movies.py
3. Store data to hive table if needed
> hive -f store_data_hive.sql
Build IMDB SQL database
Note: This is optional only if you want to have a SQL DB of IMDB movies/TV in your own storage,
our end to end flow did not use this approach)
1. Download IMDB files at ftp://ftp.fu-berlin.de
2. Install SQLobject object relational manager
3. Create a DB in an SQL database tool such as Postgres
4. Install IMDBPy
>pip install imdbpy
5. Building IMDB relational DB with IMDBPy (This step may take a few hours)
> imdbpy2sql.py -d /IMDB/download_area/ -u 'postgresql://localhost/imdb'
6. Query IMDB SQL database with the example python program
> python get_data_imdbSQL_sample.py
Tableau Dashboard
0. See current Tableau Dashboard at https://10az.online.tableau.com/#/site/w205movieproject/workbooks/1159644/views If a modification of datasource or display is needed, do the steps below
1. Download Dashboard.twb present on GitHub onto your desktop
2. When it opens, give the current PostGres database password: "pass"
3. Modify the dashboard, or the datasources in the "Data Source" section
4. Once potential modifications are done, go to Server and Sign into your Tableau Server
5. Then Publish by going to Server > Publish Workbook (select to embed password)
6. Once it is published, select the worbook, and select Data Source. They should appear live. Click on Edit connection to make sure that the password for the PostGres database has been saved
Section 4. Tools and Packages
Kafka version 2.11
Hadoop 2.6
Anaconda Python 2.7
Postgres 9.6
Apache Hive 2.1
Tableau Desktop 10.0.2
Section 5. Known Issues and Challenges
We face several software tool, compatibility and integration issues
-
When we conduct sentiment analysis, we found that python 2.7 NTL package misses Vader Lexicon score file, it needs to be installed manually
-
When using python hiver package to write hive tables, we faced TTransport Exception that prevent the hive table to be written
-
Tableau connection to postgres creates some delay at launch. Connection to hive introduced even more lag (in addition to formatting issues). Potentially, database should be converted to .tde onto the Tableau Server directly to improve performance.
-
It took many hours to built a IMDB SQL database. The schedme from IPDBPy are very convoluted and difficult to use. As a result, we decided to use the restful API.