/spark-on-kubernetes

docker image of spark for k8s

Primary LanguageShellApache License 2.0Apache-2.0

Spark on Kubernetes

A Kubernetes deployment of a stand-alone Apache Spark cluster.

With this setup, the spark UI is accessible via kubectl access - no new load balancers, no opening up any new external ports.

I created these files and howto doc to control the version of spark and to have control over the extra modules deployed on the workers.

Developed and tested on Azure ACS deployed via acs-engine. (WARNING: I've broken the gs:// support for now)

I presume you have your own working kubectl environment.

Sources

START HERE

$ kubectl create -f spark-master-controller.yaml
replicationcontroller "spark-master-controller" created
$ kubectl create -f spark-master-service.yaml
service "spark-master" created
$ kubectl create -f spark-worker-controller.yaml
replicationcontroller "spark-worker-controller" created

Done.

... Unless you want access to the UIs. In that case:

$ kubectl create -f spark-ui-proxy-controller.yaml
replicationcontroller "spark-ui-proxy-controller" created
$ kubectl create -f zeppelin-controller.yaml
replicationcontroller "zeppelin-controller" created

Done.

... Unless you also want to actually use the UIs. In that case:

  • for the kubernetes ui

    kubectl proxy --port=8001

    and follow link: kube ui

  • for the spark ui

    kubectl port-forward spark-ui-proxy-controller-<POD-ID> 8080:80

    and follow link: spark master ui

  • for the zeppelin ui

    kubectl port-forward zeppelin-controller-sq7z5 8081:8080

    and follow link: zeppeline ui

Submit a Job

example from an sbt project with an assembly task

sbt assembly && kubectl exec -i spark-master-controller-<ID> -- /bin/bash -c 'cat > my.jar && /opt/spark/bin/spark-submit --deploy-mode client --master spark://spark-master:7077 --class my.Main ./my.jar' < target/scala-2.10/*.jar

CHEAT

don't look or think, just do kubectl create -f .


Customize

Use the gen_new_cluster.sh script to create new standalone spark clusters that can run safely within the same kubernetes subnet. The script changes the name of the master used by the rest of the containers - no kube namespace used.

PREFIX="my-fav-cluster" ./gen_new_cluster.sEFIX="my" ./gen_new_cluster.sh

Then, cd build, edit the files to adjust replica counts, ports, memory, etc..., and deploy kubectl create -f ..