/docker-spark

Lightweight Docker image for Apache Spark based on Alpine Linux.

Primary LanguageDockerfileApache License 2.0Apache-2.0

Docker Spark

Apache Spark is a framework for doing distributed Big Data processing. This project contains files to build a Docker image for Spark. It is a fork of semantive/spark but has been modified to use Alpine as base to make the final image smaller. It can be used in a standalone cluster or with the accompanying docker-compose.yml as a base for more complex recipes.

Simple example

To run SparkPi, run the image with Docker:

docker run --rm -it -p 4040:4040 aa8y/spark bin/run-example SparkPi 10

Cluster example [docker-compose]

To create a simple standalone cluster with docker-compose use:

docker-compose up

The SparkUI will be running at http://${YOUR_DOCKER_HOST}:8080 with one worker listed and Spark jobs may be submitted using master spark://${YOUR_DOCKER_HOST}:7077. To connect via spark-shell with cluster use:

spark-shell --master spark://localhost:7077

Tags

We have tags for each Spark version starting 1.6.0 to 2.3.2. The tags 1 and 1.6 point to the 1.6.3 which is the latest release for that major/minor version. Similarly 2 and 2.3 point to 2.3.2. latest always points to latest Spark release which in this case is 2.3.2.

Newer Edge tags have been added. These tags build Spark from the source and should have the most recent code. The motivation behind adding these was to make testing/usage of the newer Spark code which has not been released, easier. We have tags starting from edge-1.6 to edge-2.2 for each minor Spark release and correspond to their respective branches on Github. edge points to the master branch. However these images are very bloated and it's hard to manually trigger their build as we don't know when the apache/spark repository changes. Some work needs to be done around keeping them up to date.

You can always refer to the manifest.yml file for more information about the images being built (see below).

Building / Pushing / Tagging docker images

Docker Helper has been deprecated. The project now uses Dave which is vastly superior. TravisCI builds, tests and pushes the stable tags. To build additional tags, fork the repository, enable TravisCI for your fork, change the manifest.yml file and follow instructions in Dave.

Dockerfiles

This is for the benefit of Docker Hub where you cannot host multiple Dockerfiles.

Future Work

License

Apache Licence