for demo, we have retrieved history data via Strava API and stored the csv files on S3. The realtime data retrieving process is documented in data_retrieval_strava directory.
as w205, under /data_transfer_ingestion, download data from S3 and load the files into HDFS:
$ ./load_data_lake.sh
an example log file is uploaded in the folder to illustrate a successful ingestion process.
Data Transfer
as w205, under /data_transfer_ingestion, create Hive external table for initial exploration:
$ hive -f hive_base_ddl.sql
as w205, under /data_transfer_ingestion, create managed tables for segment meta data, leaderboard data, segment geo location data, and activity data:
as w205, under /data_processing, create database and tables in Postgres for popular segment:
$ python postgres_setup.py
in another bash window, as w205, under home directory ~, start hiveserver2:
$ cd ~
$ hive --service hiveserver2
as w205, under /data_processing, extract 30 most popular segment for each category in every state, and store the stream and meta data in Postgres (this step take several minutes):
$ python job.py
last step would take several minutes, and the bash window would show:
[w205@ip-172-31-8-168 data_processing]$ python job.py
selecting popular segments...
Populating leaderboard for popular segments...
Populating altitude for popular segments...
Job completed!
Data Serving
as w205, in another bash window, under home, create a cgi-bin directory:
$ cd ~
$ mkdir cgi-bin
under /data_serving, copy HQL_SELECT.py and SQL_SELECT.py to ~/cgi-bin directory and make them executable:
AWS host is now ready to accept query request, but note:
hiveserver2 can only handle one query at time, sending a new one before the previous complete will cause issue
javascript runs asynchronously, thus please be cautious when sending AJAX query and make sure multiple queries (if necessary) are sent sequentially.
Data Visualization
Open a browser, type in host_ip:8330 in the address bar, to navigate results:
it takes ~2 minutes initializing the page, to populate the dropdowns. The speed here needs improvement
filter segment based on state and category
visualize history segment with data from Hive, or popular segment with data from Postgres
during the Visualization process, both CGI and hiveserver2 should show query process, and the browser console also has log to indicate the data transfer.
d3.js is used for Voronoi line chart
jquery.dataTables.min.js is used for displaying data in grid
A heatmap is also made to visualize geo-info of the segment