/docker-hadoop-spark

Docker image for running Hadoop + Spark + Python

Primary LanguageDockerfileMIT LicenseMIT

docker-hadoop-spark-python

Docker image for running Hadoop + Spark + Python

Based on the following articles:

And the following repository:

Steps to create a Spark cluster

Create a network:

docker network create spark_network

Build the image with your own tag:

docker build -t merolhack/spark:latest .

Enter into the bash shell:

docker run --rm -it --name spark-master --hostname spark-master \
    -p 7077:7077 -p 8080:8080 --network spark_network \
    merolhack/spark:latest /bin/sh

Inside the shell, run the following command to start a Spark Master:

/spark/bin/spark-class org.apache.spark.deploy.master.Master --ip `hostname` --port 7077 --webui-port 8080

Run a worker:

docker run --rm -it --name spark-worker --hostname spark-worker \
    --network spark_network \
    merolhack/spark:latest /bin/sh

Start the Spark Worker:

/spark/bin/spark-class org.apache.spark.deploy.worker.Worker \
    --webui-port 8080 spark://spark-master:7077

Start another Spark Worker:

docker run --rm -it --network spark_network \
    merolhack/spark:latest /bin/sh

Run the example SparkPi script 1000 times:

/spark/bin/spark-submit --master spark://spark-master:7077 --class \
    org.apache.spark.examples.SparkPi \
    /spark/examples/jars/spark-examples_2.11-2.4.4.jar 1000

Run with docker-compose

Run the containers in detached mode:

docker-compose up -d

See logs:

docker-compose logs -f

Scale the cluster up to 3 workers:

docker-compose up --scale spark-worker=3

Excecute the SparkPi sample script 1000 times:

docker exec spark-master bin/spark-submit \
    --master spark://spark-master:7077 \
    --class org.apache.spark.examples.SparkPi examples/jars/spark-examples_2.11-2.4.4.jar 1000

Docker

Remove all unused containers:

docker container prune