/spark-hats

Hands on Advanced Tutorial Session using Spark to analyze CMS data

Primary LanguageJupyter Notebook

Big Data with Spark HATS

This Hands on Advanced Tutorial Session (HATS) is presented by the LPC to demonstrate a CMS analysis using Apache Spark, Spark-ROOT, Histogrammar, and MatplotLib. After introducing Spark and the paradigm it brings with it, students will learn some basic building blocks then combine them to perform a basic measurement of the Z-boson mass using CMS data recorded in 2016.

Getting Started

Students of the HATS will be provided access to Vanderbilt's Jupyter instance using their CERN username. The jupyter instance contains this repository and all necessary software preconfigured.

Pre-Exercises

The day before the tutorial, it's critical that each student perform the pre-exercises. This way, any potential technical/login issues can be cleared up beforehand. To perform the pre-exercises, connect to Jupyter. You will first need to log in to CERN and authorize Jupyter to authenticate (don't worry, CERN doesn't transfer your password, just a secret authentication token).

Once you've given Jupyter permission to authenticate, click "Start My Server" to start your Jupyter instance.

Once your server starts, you'll be placed into the Jupyter file browser. Then, navigate to

spark-hats/notebooks/10-building-blocks.ipynb

to begin the tutorial.

Accessing this Tutorial in Jupyter

Once logged into Jupyter, navigate to the spark-hats directory and open the notebook named setup-libraries.ipynb

Built With

  • Jupyter - Interactive python notebook interface
  • Apache Spark - Fast and general engine for large-scale data processing
  • Spark-ROOT - Scala-based ROOT/IO interface to Spark
  • Histogrammar - Functional historgamming framework, optimized for Spark
  • MatplotLib - Python plotting library

Authors

Acknowledgments

  • The LPC Distinguished Researcher Program (link) - Support for the author
  • Advanced Computing Center for Research and Education (ACCRE) (link) - Host facility and sysadmin support
  • The Diana-HEP project (link - Interoperability and compatibility libaries
  • Vanderbilt Trans Institutional Program (TIPs) Award (link) - Big Data hardware seed funding