Notice

This is a modified version using spark-2.11-hadoop2.7 so hive support is not tested. Since there is no available image of spark-notebook with spark-2.11-hadoop2.7, the Dockerfile for spark-notebook is modified and included.

How to use HDFS/Spark Workbench

To start an HDFS/Spark Workbench:

    docker-compose up -d

To scale up spark-workers:

    docker-compose scale spark-worker=3

Interfaces

Namenode: http://localhost:50070
Datanode: http://localhost:50075
Spark-master: http://localhost:8080
Spark-notebook: http://localhost:9001
Hue (HDFS Filebrowser): http://localhost:8088/home

Important

When opening Hue, you might encounter NoReverseMatch: u'about' is not a registered namespace error after login. I disabled 'about' page (which is default one), because it caused docker container to hang. To access Hue when you have such an error, you need to append /home to your URI: http://docker-host-ip:8088/home

Docs

Motivation behind the repo and an example usage @BDE2020 Blog

Count Example for Spark Notebooks

val spark = SparkSession
  .builder()
  .appName("Simple Count Example")
  .getOrCreate()

val tf = spark.read.textFile("/data.csv")
tf.count()

Maintainer

Ivan Ermilov @earthquakesan