Fantasy Sports Leagues

About

Fantasy Sports Leagues is my Data Engineering project as part of Insight Data Science's Engineering fellowship program 2015A.

Introduction

I decided to combine my love for data and sports during my project.

While still focusing on the Data Engineering aspect, I thought it would be interesting to learn about the implications of trying to develop a pipeline that updates with real-time events and serves a user base of ~ 5 million people.

My technology stack includes: Kafka, HDFS, Spark, Spark Streaming, and Cassandra and it is described below.

Website

The project is currently hosted at http://4fsports.net

Setup

The pipeline lives on an AWS EC2 cluster.

Three instances are dedicated to Cloudera's Hadoop Distribution (CDH5, Cloudera Manager 5.1.4).
Three instances are dedicated to Datastax AMI distribution of Cassandra.
One micro-instance is dedicated to hosting the Flask Web Server. Developed in a different repository

Pipeline

I tried to follow the general guidelines of a Lambda Architecture. Below is a general overview of the pipeline I'm using:

Data Ingestion

There are two types of data in the pipeline:

Engineered User Data
NFL Play-by-Play Data for the 2014 Season

The engineered data is composed of the users' roster (currently only 4 players) and is generated via a Ruby script.

Kafka 2.9.2 was installed in the name node and a topic was created to follow play-by-play data. In order to re-create the real-time action of the NFL data a python script has been written. The python script reads in plays from the past season and produces messages that are sent to the Kafka broker.

There are two Kafka receivers: one written in Python which just taks the messages and saves them to HDFS and another one inside the Spark Streaming function.

Real-time / Speed Layer

A play consists of a player name, points scored, and a timestamp. Spark streaming acts as a Kafka receiver and for every record that comes in: looks up a list of users that have that player in their team and creates a new record (information) for each user.

The generated information is both saved to HDFS and also processed further by keeping an aggregate count of points per user. The aggregate count is saved to Cassandra where it will be accessed by the Serving Layer and added to the historical count generated by the batch query.

Batch Layer

Spark is running on top of HDFS and running a couple batch queries that run every 24 hours:

Calculate Top 10 users of all time
Calculate Users Points for a play week
Calculate Player's points by game

Serving Layer

The serving layer consists of a Cassandra cluster and a Flask web server.

The schema for the Cassandra tables was written in CQL and can be found in the serving layer folder.

Future Plans

My grand vision is to implement real-time game substitutions. I would enjoy seeing if the users can substitute players in the middle of the games and still get the points. Besides the logistic aspect of allowing

sujee/FantasySportsLeagues