/docker-spark

Dockerfile for running Apache Spark on Ubuntu

Apache License 2.0Apache-2.0

Apache Spark

dockeri.co

stars forks issues

Supported tags and respective Dockerfile links

What is Spark ?

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

https://spark.apache.org/docs/latest/

What is Docker?

Docker is an open platform for developers and sysadmins to build, ship, and run distributed applications. Consisting of Docker Engine, a portable, lightweight runtime and packaging tool, and Docker Hub, a cloud service for sharing applications and automating workflows, Docker enables apps to be quickly assembled from components and eliminates the friction between development, QA, and production environments. As a result, IT can ship faster and run the same app, unchanged, on laptops, data center VMs, and any cloud.

https://www.docker.com/whatisdocker/

What is a Docker Image?

Docker images are the basis of containers. Images are read-only, while containers are writeable. Only the containers can be executed by the operating system.

https://docs.docker.com/terms/image/

Dependencies

Base Docker image

Branch Base Image Description
master gelog/java:openjdk7 Spark pre-built for Hadoop
spark-for-hadoop " " Spark pre-built for Hadoop (dev branch)
spark-from-source scala:2.10.4 Spark built from source

Note: currently the spark-from-source image takes quite a while to build, and generates 2.3 GB of virtual size.

The recommended branch for general use is master.

How to use this image?

Spark Master

docker run -d --name spark-master -h spark-master -p 8080:8080 -p 7077:7077 \
gelog/spark:1.2-bin-hadoop2.3 spark-class org.apache.spark.deploy.master.Master

Spark Worker

docker run -d --name spark-worker1 -h spark-worker1 --link=hdfs-namenode:hdfs-namenode --link=spark-master:spark-master \
gelog/spark:1.2-bin-hadoop2.3 spark-class org.apache.spark.deploy.worker.Worker spark://spark-master:7077