Tutorial on creating cloud infrastructure to store and use SportVU movement data. This tutorial was made for linux users. Because this tutorial utilizes tableau for visualizations, it requires windows for those steps.
- Modify the
movement/constant.py
file for the cloned repo location.
import os
# change this data_dir for personal path
if os.environ['HOME'] == '/home/neil':
data_dir = '/home/neil/projects/nba-movement-hive'
else:
raise Exception("Unspecified data_dir, unknown environment")
- Install the Python package
python setup.py build
sudo python setup.py install
- Extract the data from the
data
folder
cd data/
sudo ./setup.sh
- Convert the json files to the proper csv files
python movement/json_to_csv.py
AWS requires a secure key in order to SSH into the EMR instances. In order to do so, instructions are provided to create a pem key on the EC2 console below.
- Create an S3 bucket on AWS and upload the csv documents extracted to the bucket. Make sure each item in the bucket is public.
-
Create a default EMR cluster on m1.medium instances (cheapest available) with one master and 2 core nodes. Wait until the cluster has a
waiting
status. -
Add an inbound rule to the master security group on the cluster for all connections. See the anywhere rule at the bottom of the below image.
- SSH into the EMR cluster. The EMR cluster should provide the proper command.
ssh - i {pem-key} {ec2-login}
- If you get an error denying access because the key is public, you can edit the permissions with
chmod 400 {pem-key}
- Create table in EMR once connected to the cluster. Enter the hive tool and paste the
tables/create_movement_hive.sql
,tables/create_shots_hive.sql
scripts to create the table. Pase thetables/load_data_hive.sql
script to load the csv's downloaded to the cluster.
hive
- Verify the data stored by querying the different games stored.
select distinct(game_id) from movement;
-
Install Tableau Desktop (not public version).
-
Install the ODBC drivers for HiveServer2.
-
Add Amazon Hive ODBC drivers to the ODBC Data Source Admin tool in Windows.
- Add the dns url from the cluster in the configuration for the driver on the ODBC Admin Tool.
- Add the EMR connection on tableau. Use the same port and username as before. Enable ssl connection.
- Once the connection is opened, you can add the schema and the tables you want to visualize.
-
Once the data connection is setup, you can start playing with the shot data. Check off
Analysis > Aggregate Measures
. -
Import the
img/nba_court.jpg
as a background image. Go toMap > Background Images
and use the below setting for the half court display.
-
Use the
Loc X
for the columns and theLoc Y
field for the rows. -
Create art.
Note that these operations take a lot of time due to the external queries and how advanced the EC2 instances you chose to use on the EMR cluster are.