/w205_project

MIDS W205 Project

Primary LanguageJavaScript

Path Finder - A cycling data analytic tool using Apache Hive

MIDS W205 Final Project, Fall 2015

Lei Yang | Nilesh Bhoyar | Tuhin Mahmud

Presentation | Report

Deployment Guide

System Requirement

  1. port 8330 is open for communication
  2. HDFS, Python, Postgres, Hive, and hiveserver2 are installed and properly configured
  3. the following Python libraries are installed:

$ pip install pyhs2
$ pip install psycopg2
$ pip install stravalib
  1. below directories exist on the system:
  • file:///data
  • hdfs:///user/w205
  1. as w205, checkout repo:
git clone git@github.com:leiyang-mids/w205_project.git

Data Ingestion

  1. for demo, we have retrieved history data via Strava API and stored the csv files on S3. The realtime data retrieving process is documented in data_retrieval_strava directory.
  2. as w205, under /data_transfer_ingestion, download data from S3 and load the files into HDFS:
$ ./load_data_lake.sh
  1. an example log file is uploaded in the folder to illustrate a successful ingestion process.

Data Transfer

  1. as w205, under /data_transfer_ingestion, create Hive external table for initial exploration:
$ hive -f hive_base_ddl.sql
  1. as w205, under /data_transfer_ingestion, create managed tables for segment meta data, leaderboard data, segment geo location data, and activity data:

$ hive -f hive_segment_ddl.sql
$ hive -f hive_leaderboard_ddl.sql
$ hive -f hive_stream_ddl.sql
$ hive -f hive_activity_ddl.sql

Data Processing

  1. as w205, under /data_processing, create database and tables in Postgres for popular segment:
$ python postgres_setup.py
  1. in another bash window, as w205, under home directory ~, start hiveserver2:

$ cd ~
$ hive --service hiveserver2
  1. as w205, under /data_processing, extract 30 most popular segment for each category in every state, and store the stream and meta data in Postgres (this step take several minutes):
$ python job.py
  1. last step would take several minutes, and the bash window would show:

[w205@ip-172-31-8-168 data_processing]$ python job.py
selecting popular segments...
Populating leaderboard for popular segments...
Populating altitude for popular segments...
Job completed!

Data Serving

  1. as w205, in another bash window, under home, create a cgi-bin directory:

$ cd ~
$ mkdir cgi-bin
  1. under /data_serving, copy HQL_SELECT.py and SQL_SELECT.py to ~/cgi-bin directory and make them executable:

$ cp *_SELECT.py ~/cgi-bin/
$ cd ~/cgi-bin
$ chmod +x *_SELECT.py      
  1. as w205, under home, start Python CGI service:

$ cd ~
$ python -m CGIHTTPServer 8330
  1. edit main.html, insert your AWS IP into line: var host = {host ip}
  2. as w205, under /data_serving, copy the website scripts to home:
    $ cp main.html ~/index.html
    $ cp *.js ~/
    $ cp *.css ~/
    
  3. AWS host is now ready to accept query request, but note:
  • hiveserver2 can only handle one query at time, sending a new one before the previous complete will cause issue
  • javascript runs asynchronously, thus please be cautious when sending AJAX query and make sure multiple queries (if necessary) are sent sequentially.

Data Visualization

  1. Open a browser, type in host_ip:8330 in the address bar, to navigate results:
  • it takes ~2 minutes initializing the page, to populate the dropdowns. The speed here needs improvement
  • filter segment based on state and category
  • visualize history segment with data from Hive, or popular segment with data from Postgres
  • during the Visualization process, both CGI and hiveserver2 should show query process, and the browser console also has log to indicate the data transfer.
  1. d3.js is used for Voronoi line chart
  2. jquery.dataTables.min.js is used for displaying data in grid
  3. A heatmap is also made to visualize geo-info of the segment