/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

Primary LanguageJupyter NotebookMIT LicenseMIT

Apache Spark Standalone Cluster on Docker

The project just got its own article at Towards Data Science Medium blog!

This project gives you an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data.

build jupyterlab-latest-version spark-latest-version docker-version docker-compose-file-version spark-scala-api spark-pyspark-api spark-sparkr-api

TL;DR

curl -LO https://raw.githubusercontent.com/andre-marcos-perez/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start

Cluster overview

Application URL Description
JupyterLab localhost:8888 Cluster interface with built-in Jupyter notebooks
Apache Spark Master localhost:8080 Spark Master node
Apache Spark Worker I localhost:8081 Spark Worker node with 1 core and 512m of memory (default)
Apache Spark Worker II localhost:8082 Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Build from Docker Hub

  1. Download the source code or clone the repository;
  2. Edit the docker compose file with your favorite tech stack version, check apps supported versions;
  3. Build the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c.

Build from your local machine

Note: Local build is currently only supported on Linux OS distributions.

  1. Download the source code or clone the repository;
  2. Move to the build directory;
cd build
  1. Edit the build.yml file with your favorite tech stack version;
  2. Match those version on the docker compose file;
  3. Build the images;
chmod +x build.sh ; ./build.sh
  1. Build the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c.

Tech Stack

  • Infrastructure
Component Version
Docker Engine 1.13.0+
Docker Compose 1.10.0+
Python 3.7.3
Scala 2.12.11
R 3.5.2
  • Jupyter Kernels
Component Version Provider
Python 2.1.4 Jupyter
Scala 0.10.0 Almond
R 1.1.1 IRkernel
  • Applications
Component Version Docker Tag
Apache Spark 2.4.0 | 2.4.4 | 3.0.0 <spark-version>-hadoop-2.7
JupyterLab 2.1.4 <jupyterlab-version>-spark-<spark-version>

Apache Spark R API (SparkR) is only supported on version 2.4.4. Full list can be found here.

Docker Hub Metrics

Image Size Downloads
JupyterLab docker-size-jupyterlab docker-pull
Spark Master docker-size-master docker-pull
Spark Worker docker-size-worker docker-pull

Contributing

We'd love some help. To contribute, please read this file.

Staring us on GitHub is also an awesome way to show your support

Contributors