leiyang-mids/w205_project

MIDS W205 Project

JavaScript

Path Finder - A cycling data analytic tool using Apache Hive

MIDS W205 Final Project, Fall 2015

Lei Yang | Nilesh Bhoyar | Tuhin Mahmud

Presentation | Report

Deployment Guide

System Requirement

port 8330 is open for communication
HDFS, Python, Postgres, Hive, and hiveserver2 are installed and properly configured
the following Python libraries are installed:


$ pip install pyhs2
$ pip install psycopg2
$ pip install stravalib

below directories exist on the system:

file:///data
hdfs:///user/w205

as w205, checkout repo:

git clone git@github.com:leiyang-mids/w205_project.git

Data Ingestion

for demo, we have retrieved history data via Strava API and stored the csv files on S3. The realtime data retrieving process is documented in data_retrieval_strava directory.
as w205, under /data_transfer_ingestion, download data from S3 and load the files into HDFS:

$ ./load_data_lake.sh

an example log file is uploaded in the folder to illustrate a successful ingestion process.

Data Transfer

as w205, under /data_transfer_ingestion, create Hive external table for initial exploration:

$ hive -f hive_base_ddl.sql

as w205, under /data_transfer_ingestion, create managed tables for segment meta data, leaderboard data, segment geo location data, and activity data:


$ hive -f hive_segment_ddl.sql
$ hive -f hive_leaderboard_ddl.sql
$ hive -f hive_stream_ddl.sql
$ hive -f hive_activity_ddl.sql

Data Processing

as w205, under /data_processing, create database and tables in Postgres for popular segment:

$ python postgres_setup.py

in another bash window, as w205, under home directory ~, start hiveserver2:


$ cd ~
$ hive --service hiveserver2

as w205, under /data_processing, extract 30 most popular segment for each category in every state, and store the stream and meta data in Postgres (this step take several minutes):

$ python job.py

last step would take several minutes, and the bash window would show:


[w205@ip-172-31-8-168 data_processing]$ python job.py
selecting popular segments...
Populating leaderboard for popular segments...
Populating altitude for popular segments...
Job completed!

Data Serving

as w205, in another bash window, under home, create a cgi-bin directory:


$ cd ~
$ mkdir cgi-bin

under /data_serving, copy HQL_SELECT.py and SQL_SELECT.py to ~/cgi-bin directory and make them executable:


$ cp *_SELECT.py ~/cgi-bin/
$ cd ~/cgi-bin
$ chmod +x *_SELECT.py

as w205, under home, start Python CGI service:


$ cd ~
$ python -m CGIHTTPServer 8330

edit main.html, insert your AWS IP into line: var host = {host ip}
as w205, under /data_serving, copy the website scripts to home:
```
$ cp main.html ~/index.html
$ cp *.js ~/
$ cp *.css ~/
```
AWS host is now ready to accept query request, but note:

hiveserver2 can only handle one query at time, sending a new one before the previous complete will cause issue
javascript runs asynchronously, thus please be cautious when sending AJAX query and make sure multiple queries (if necessary) are sent sequentially.

Data Visualization

Open a browser, type in host_ip:8330 in the address bar, to navigate results:

it takes ~2 minutes initializing the page, to populate the dropdowns. The speed here needs improvement
filter segment based on state and category
visualize history segment with data from Hive, or popular segment with data from Postgres
during the Visualization process, both CGI and hiveserver2 should show query process, and the browser console also has log to indicate the data transfer.

d3.js is used for Voronoi line chart
jquery.dataTables.min.js is used for displaying data in grid
A heatmap is also made to visualize geo-info of the segment