/spark-frameless-talk-2018

Presentation and notebook sources for Scala IO and Scale by the Bay 2018 Spark and Frameless talk

Primary LanguageScalaApache License 2.0Apache-2.0

Source to my "Introduction to Apache Spark using Frameless" talk

This repo contains the sources for both the slides and the Databricks notebooks for my Introduction to Apache Spark using Frameless talk, given at ScalaIO and at Scale by the Bay in 2018.

The slides

The slides are written in Markdown and must be translated to HTML+Reveal.JS using Pandoc. The following executables must be present in your shell's PATH to build the slides:

  • pandoc (version 2.3.1 or better)
  • lessc (version 3.0.4 or better), for LESS stylesheet translation
  • git, to check out the Reveal.js repository.

To build the slides, just run ./build.sh. It'll build a standalone slides.html file in the top-level directory.

The Databricks notebooks

The notebooks folder contains the individual notebooks used during the presentation. You'll need all three. If you want, you can import them individually. Or, you can simply download and import the notebooks.dbc file in this directory; it contains all three notebooks.

For information on how to import notebooks into Databricks, including Databricks Community Edition, see https://docs.databricks.com/user-guide/notebooks/notebook-manage.html#import-a-notebook

There are three notebooks:

  • Defs.scala: definitions shared across the other two notebooks (each of which invokes Defs)
  • 00-Create-Data-Files.scala, which downloads a data file of tweets from early 2018 and also parses a Kafka stream of current tweets, producing the new data files needed by the presentation. Follow the instructions in this notebook to create local copies of the data. BUT, also, see below.
  • 01-Presentation.scala is the hands-on notebook part of the presentation.

Software Versions

I ran the notebooks in Databricks, with:

  • Spark 2.3
  • Scala 2.11
  • frameless-dataset_2.11-0.7.0
  • frameless-cats_2.11-0.7.0

The Data

You can us the 00-Create-Data-Files.scala to download and create the data. However, if you'd prefer to use existing data, you can also just get existing Parquet files from the following locations:

My recommendation:

  1. Download those zip files.
  2. Unzip them.
  3. Upload them to your own S3 bucket.
  4. In a Databricks workspace (such as Databricks Community Edition), mount your S3 bucket to DBFS.
  5. Update the paths (in the Defs.scala notebook) to point to your S3 bucket.
  6. Enjoy.

Feel free to drop me email (bmc@clapper.org) if you need help.