Project 3: Understanding User Behavior

  • You're a data scientist at a game development company

  • Your latest mobile game has two events you're interested in tracking: buy a sword & join guild

  • Each has metadata characterstic of such events (i.e., sword type, guild name, etc)

Tasks

  • Instrument your API server to log events to Kafka

  • Assemble a data pipeline to catch these events: use Spark streaming to filter select event types from Kafka, land them into HDFS/parquet to make them available for analysis using Presto.

  • Use Apache Bench to generate test data for your pipeline.

  • Produce an analytics report where you provide a description of your pipeline and some basic analysis of the events. Explaining the pipeline is key for this project!

  • Submit your work as a git PR as usual. AFTER you have received feedback you have to merge the branch yourself and answer to the feedback in a comment. Your grade will not be complete unless this is done!

Use a notebook to present your queries and findings. Remember that this notebook should be appropriate for presentation to someone else in your business who needs to act on your recommendations.

It's understood that events in this pipeline are generated events which make them hard to connect to actual business decisions. However, we'd like students to demonstrate an ability to plumb this pipeline end-to-end, which includes initially generating test data as well as submitting a notebook-based report of at least simple event analytics. That said the analytics will only be a small part of the notebook. The whole report is the presentation and explanation of your pipeline plus the analysis!

Options

There are plenty of advanced options for this project. Here are some ways to take your project further than just the basics we'll cover in class:

  • Generate and filter more types of events. There are plenty of other things you might capture events for during gameplay

  • Enhance the API to use additional http verbs such as POST or DELETE as well as additionally accept parameters for events (e.g., purchase events might accept sword or item type)

  • Connect a user-keyed storage engine such as Redis or Cassandra up to Spark so you can track user state during gameplay (e.g., user's inventory or health)


GitHub Procedures

Important: In w205, please never merge your assignment branch to the master branch.

Using the git command line: clone down the repo, leave the master branch untouched, create an assignment branch, and move to that branch:

  • Open a linux command line to your virtual machine and be sure you are logged in as jupyter.
  • Create a ~/w205 directory if it does not already exist mkdir ~/w205
  • Change directory into the ~/w205 directory cd ~/w205
  • Clone down your repo git clone <https url for your repo>
  • Change directory into the repo cd <repo name>
  • Create an assignment branch git branch assignment
  • Checkout the assignment branch git checkout assignment

The previous steps only need to be done once. Once you your clone is on the assignment branch it will remain on that branch unless you checkout another branch.

The project workflow follows this pattern, which may be repeated as many times as needed. In fact it's best to do this frequently as it saves your work into GitHub in case your virtual machine becomes corrupt:

  • Make changes to existing files as needed.
  • Add new files as needed
  • Stage modified files git add <filename>
  • Commit staged files git commit -m "<meaningful comment about your changes>"
  • Push the commit on your assignment branch from your clone to GitHub git push origin assignment

Once you are done, go to the GitHub web interface and create a pull request comparing the assignment branch to the master branch. Add your instructor, and only your instructor, as the reviewer. The date and time stamp of the pull request is considered the submission time for late penalties.

If you decide to make more changes after you have created a pull request, you can simply close the pull request (without merge!), make more changes, stage, commit, push, and create a final pull request when you are done. Note that the last data and time stamp of the last pull request will be considered the submission time for late penalties.

Make sure you receive the emails related to your repository! Your project feedback will be given as comment on the pull request. When you receive the feedback, you can address problems or simply comment that you have read the feedback. AFTER receiving and answering the feedback, merge you PR to master. Your project only counts as complete once this is done.