Purpose

This docker container is meant to be used for learning purpose for programming PySpark. It has the following components.

  • Hadoop v3.2.1
  • Spark v2.4.4
  • Conda 3 with Python v3.7

After running the container, you may visit the following pages.

  • HDFS
  • YARN
  • Spark
  • Spark History
  • Jupyter Lab

To run the docker container, type in the following.

bash ./start-docker-container.sh

Click on below link to access portal

Name Node

Hadoop Cluster

Spark Master

History Server

Jupyter lab

Hadoop Data Node

Airflow Image

Spark Worker Node

Airflow Scheduler