Openshift Spark is dockerize application based on the Centos 7 image for deploying Apache Spark 2.4.2 cluster to OpenShift.
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
OpenShift is an open source container application platform by Red Hat based on top of Docker containers and the Kubernetes container cluster manager.
Deployment time: 30 minutes
Clone this repo to local machine.
$ git clone https://github.com/bodz1lla/openshift-spark.git
$ cd openshift-spark
Create a build config and start build a spark image.
$ oc create -f openshift/build-spark-base.yaml
$ oc create imagestream spark
$ oc start-build spark-2.4.2
When the build has finished, please check logs and status.
$ oc logs -f bc/spark-2.4.2
$ oc get pod
NAME READY STATUS RESTARTS AGE
spark-2.4.2-1-build 0/1 Completed 0 6m
Create a deployment config and start master.
$ oc create -f openshift/deploy-spark-master.yaml
When master has started, please check logs and state "Running".
$ oc logs -f dc/spark-master
$ oc get pod
NAME READY STATUS RESTARTS AGE
spark-2.4.2-1-build 0/1 Completed 0 4m
spark-master-1-mxlhj 1/1 Running 0 55s
Create a services and endpoints.
$ oc create -f openshift/service_spark_master.yaml
$ oc create -f openshift/service_spark_master_ui.yaml
Expose a service and create a route to allow external connections reach Spark by DNS name.
$ oc expose svc/spark-master-ui --name=spark-master-ui --port=8080
Check a route and try to access the Spark via Web browser or cURL.
$ oc get route spark-master-ui
$ curl -s http://${SPARK_MASTER_UI}
If you'd like to configure secure HTTPS connection with selfsigned certificate using TLS edge termination.
Please install "keytool" and generate a keystore, otherwise just skip this step and move to Spark Workers.
ATTENTION: Replace a var=${SECRET_PASS} with password.
$ keytool -genkey -keyalg RSA -alias selfsigned -keystore keystore.jks -storepass ${SECRET_PASS} -validity 360 -keysize 2048
# Convert to pkcs12
$ keytool -importkeystore -srckeystore keystore.jks -destkeystore keystore.p12 -srcstoretype jks -deststoretype pkcs12
Once key has been created, open it with OpenSSL.
$ openssl pkcs12 -in keystore.p12 -nodes -password pass:${SECRET_PASS}
Copy certificate with private key that have been displayed and save in the notes.
Edit route and insert TLS configuration in the "spec:" collection, behind the "port:" key as described below:
oc edit route spark-master-ui
---
spec:
...
port:
targetPort: 8080
tls:
certificate: |
-----BEGIN CERTIFICATE-----
...
-----END CERTIFICATE-----
key: |
-----BEGIN PRIVATE KEY-----
...
-----END PRIVATE KEY-----
termination: edge
insecureEdgeTerminationPolicy: Redirect
Don't forget about YAML syntax and 2 space indent.
Check route and try to access via HTTPS.
$ oc get route spark-master-ui
$ curl -sk https://${SPARK_MASTER_UI}
Create a deployment config and start workers.
The default setup starts only 3 workers, you can change this in deploy-spark-worker.yaml file. Replace a value in the key "replicas:"
$ oc create -f openshift/deploy-spark-workers.yaml
Check logs and workers state "Running".
$ oc logs -f dc/spark-workers
$ oc get pods
NAME READY STATUS RESTARTS AGE
spark-2.4.2-1-build 0/1 Completed 0 37m
spark-master-1-7xqdq 1/1 Running 0 34m
spark-workers-1-7tj9d 1/1 Running 0 5m
spark-workers-1-8fbh2 1/1 Running 0 5m
spark-workers-1-kfdcm 1/1 Running 0 5m
If you see the same output with all pods are "Running", it means you successfully has installed Spark cluster :)
This section explains how to submit applications to the cluster remotely.
-
Download Spark release to local machine.
-
Check firewall settings and allow TCP connections to the node port 30077.
You don't need to change anything on the OpenShift server, a step above only applies to the external firewalls belong to AWS Security Groups, Data-Centers providers like Hetzner, etc.
Current project runs with Spark version 2.4.2.
It’s important that the Spark version running on the driver, master, and worker pods all match.
- Try to run Python or Java example application on the cluster.
cd spark-2.4.2-bin-hadoop2.7
# Python
./bin/spark-submit \
--master spark://${OPENSHIFT_CLUSTER_IP}:30077 --name myapp \
${PWD}/examples/src/main/python/pi.py 10
# Java
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master spark://${OPENSHIFT_CLUSTER_IP}:30077 \
--name myapp \
--deploy-mode client \
--supervise \
--executor-memory 4G \
--total-executor-cores 100 \
local://${PWD}/examples/jars/spark-examples_2.12-2.4.2.jar 10
If connection successful has been created to the cluster and you see the running application in Spark UI, it means you've completed testing.
Hope you enjoyed the setup and ready to launch new applications!
- Fork it (https://github.com/bodz1lla/openshift-spark/fork)
- Create your feature branch (git checkout -b feature/foobar)
- Commit your changes (git commit -am 'Add some foobar')
- Push to the branch (git push origin feature/foobar)
- Create a new Pull Request
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the terms of the MIT license.
See COPYING to see the full text.
- The Apache Software Foundation - Apache Spark
- Thomas Orozco - init for containers tini
- Veer Muchandi - video explanation - OpenShift: Using SSL