/spark-standalone-cluster-on-docker

Learn Apache Spark in Scala, Python (PySpark) and R (SparkR) by building your own cluster with a JupyterLab interface on Docker. :zap:

Primary LanguageJupyter NotebookMIT LicenseMIT

Apache Spark Cluster supporting Patient/Donor stem cell matching

This project builds an Apache Spark cluster in standalone mode with a JupyterLab interface built on top of Docker. Learn Apache Spark through its Scala, Python (PySpark) and R (SparkR) API by running the Jupyter notebooks with examples on how to read, process and write data. This work is based on (and forked from) the original project by André Perez - dekoperez - andre.marcos.perez@gmail.com.

My expansion of this work applies Apache Spark/GraphX and Hadoop to impute N-locus haplotype graphs where edges are generated from public haplotype frequencies and connect nodes of sub-haplotype components. This will be used to provide a scalable patient / donor matching solution for bone marrow transplants.

build jupyterlab-latest-version spark-latest-version docker-version docker-compose-file-version spark-scala-api spark-pyspark-api spark-sparkr-api

TL;DR

curl -LO https://raw.githubusercontent.com/kunau/spark-standalone-cluster-on-docker/master/docker-compose.yml
docker-compose up

Contents

Quick Start

Cluster overview

Application URL Description
JupyterLab localhost:8888 Cluster interface with built-in Jupyter notebooks
Apache Spark Master localhost:8080 Spark Master node
Apache Spark Worker I localhost:8081 Spark Worker node with 1 core and 512m of memory (default)
Apache Spark Worker II localhost:8082 Spark Worker node with 1 core and 512m of memory (default)

Prerequisites

Build from Docker Hub

  1. Download the source code or clone the repository;
  2. Edit the docker compose file with your favorite tech stack version, check apps supported versions;
  3. Build the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c.

Build from your local machine

Note: Local build is currently only supported on Linux OS distributions.

  1. Download the source code or clone the repository;
  2. Move to the build directory;
cd build
  1. Edit the build.yml file with your favorite tech stack version;
  2. Match those version on the docker compose file;
  3. Build the images;
chmod +x build.sh ; ./build.sh
  1. Build the cluster;
docker-compose up
  1. Run Apache Spark code using the provided Jupyter notebooks with Scala, PySpark and SparkR examples;
  2. Stop the cluster by typing ctrl+c.

Tech Stack

  • Infrastructure
Component Version
Docker Engine 1.13.0+
Docker Compose 1.10.0+
Python 3.7.3
Scala 2.12.11
R 3.5.2
  • Jupyter Kernels
Component Version Provider
Python 2.1.4 Jupyter
Scala 0.10.0 Almond
R 1.1.1 IRkernel
  • Applications
Component Version Docker Tag
Apache Spark 2.4.0 | 2.4.4 | 3.0.0 <spark-version>-hadoop-2.7
JupyterLab 2.1.4 <jupyterlab-version>-spark-<spark-version>

Apache Spark R API (SparkR) is only supported on version 2.4.4. Full list can be found here.

Docker Hub Metrics

Image Size Downloads
JupyterLab docker-size-jupyterlab docker-pull
Spark Master docker-size-master docker-pull
Spark Worker docker-size-worker docker-pull

Contributing

We'd love some help. To contribute, please read this file.

Staring us on GitHub is also an awesome way to show your support ⭐

Contributors