The simplest steps to run a spark application using Spark Operator are:
- Install Cloud Code (IDE plugin)
- Install WSL2 on Windows (Windows only)
- Install Docker for Desktop
- Install Helm
- Install the spark operator using Helm
- Run the setup.sh script in the terminal is preferred until issue 1401 is fixed.
./setup.sh)
- Run the setup.sh script in the terminal is preferred until issue 1401 is fixed.
- Press the Play button on the Run configuration in PyCharm to configure a Cloud Code configuration.
If you do the above, then you don't need to read the following. Here's a bit more information.
Only needed for Windows operating system.
On windows, I generally use WSL2 rather than the Windows command prompt, because nearly all Kubernetes and Docker examples and help are written for Linux scripts. Here are guides for the WSL2 install:
Once installed, run all the scripts in the WSL2 (Ubuntu terminal). Note that your there will be Kubernetes settings in ~/.kube/config in WSL2 directories.
Install Helm. See Helm install. (On WSL2 or Linux, follow the Debian/Ubuntu section) Run the included setup.sh script is simplest. It does the steps described in the following.
Update your local cache:
helm repo update
You can list your charts or those available with:
helm list -A
helm search repo
Ensure that you are using the correct context (local desktop or remote cluster)
kubectl config get-contexts
kubectl config use-context docker-desktop
Install the operator and it's webhook. (If you set the sparkJobNamespace in the operator install, then the namespace must exist and it must match that specified in the SparkApplication yaml file.) From the Spark Operatior documentation You could do the following:
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
helm install my-release spark-operator/spark-operator --create-namespace --namespace spark-operator --set webhook.enable=true
But I recommend the following instead. That's because if you want to use jupyter notebook with spark-operator (or any other software that required exposed ports), then you'll need a fixed version of the operator newer than 1.1.26. (See issue 1401.) You can install my fixed version (included herein), as follows:
helm install my-release spark-operator-1.1.26_fixed.tgz --create-namespace --namespace spark-operator --set webhook.enable=true
If necessary, you can uninstall a helm chart by referring to its name:
helm uninstall my-release --namespace spark-operator
The name, my-release, is arbitrary; you can give it any name.
Ensure a spark service account, according to RBAC guide.
kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default
kubectl get serviceaccounts
The following is unnecessary if you are using the Cloud Code (IDE plugin).
The Clod Code plugin runs Docker and skaffold to perform these operations.
Run and then view an application. Here the application is spark_hello; refer to the name inside the yaml file.
kubectl apply -f resources/spark_hello.yaml
kubectl get sparkapplications spark-hello -o=yaml
kubectl describe sparkapplication spark-hello
Assuming that the application wrote data to the driver (rather than mount a resource), the data can be copied to the local drive as follows:
#kubectl cp namespace/pod:remote_path local_path
kubectl cp default/gha-driver:/opt/spark/work-dir/result.json ./result.json
kubectl get sparkapplications --all-namespaces
kubectl get deployments --all-namespaces
kubectl delete sparkapplications spark-hello
- You may need to remove the spark application to redeploy it.
delete sparkapplications spark-hello
Development on Docker for Desktop with its local Docker registry is simpler than connecting to a remote cluster. Various configurations require a hostname, which is simply localhost for running on a local Docker for Desktop cluster. Below are guidelines for working with a remote (microk8s, GKE) cluster, which may be ignored if you are using a local cluster. Local or remote cluster, you will need to install the spark operator using Helm.
When using microk8s, the various configurations that require a hostname must reference the cluster manager hostname. If you are using the microk8s built-in registry, ensure that you can access it from outside the cluster. After enabling it, edit the cert.d directory includes the hosts.toml file which allows access via the cluster manager hostname.
# /var/snap/microk8s/current/args/certs.d/
server = "http://HOSTNAME:32000"
[host."http://HOSTNAME:32000"]
capabilities = ["pull", "resolve", "push"]
plain-http = true
Do not use localhost with port forwarding; a bug blocks access using localhost.
In the Docker for Desktop settings (available from the Docker task icon), Docker Engine:
"insecure-registries" : ["HOSTNAME:32000"]
Where, HOSTNAME is the remote hostname.
Images in the yaml files (for the spark operator and skaffold) must refer to the hostname when using the built-in registry.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
...
image: "HOSTNAME:32000/k8sparkgha"
# Always update the image
imagePullPolicy: Always
apiVersion: skaffold/v2beta25
kind: Config
metadata:
name: gha
build:
tagPolicy:
# Ensure that skaffold marks the image as latest
sha256: {}
artifacts:
- image: HOSTNAME:32000/k8sparkgha
...
The image repository, for Cloud Code (IDE plugin), should be the cluster manager hostname and registry port (when using the built-in registry).
HOSTNAME:32000
Removing images from the microk8s built-in registry is not easy. See the prune.sh script, copied from microk8s cheat sheet.
Applications on the cluster are exposed through ports, which may be only accessible from within the cluster. When using a remote cluster, use kubectl port-forward to access the cluster ports from your local computer. Here are some common ports:
Spark UI (at 4040)
kubectl port-forward --namespace=default --address localhost service/gha-ui-svc 4040:4040
Kubernetes Dashboard (at 10443)
kubectl port-forward -n kube-system service/kubernetes-dashboard 10443:443
Kubernetes Docker private registry (at 32000); typically, unnecessary. Instead, use the cluster manager node in image references.
kubectl port-forward --namespace=container-registry --address localhost service/registry 32000:5000
To access the Kubernetes Dashboard, you'll need to obtain a token. See the dashboard instructions.
token=$(microk8s kubectl -n kube-system get secret | grep default-token | cut -d " " -f1)
microk8s kubectl -n kube-system describe secret $token
You can build your own Docker image to run custom spark applications. Review this explanation of using Spark Docker image on Kubernetes. Review the images on datamechanics/spark. After editing your docker file, build your image. Below, it is tagged in preparation for uploading to GCP. Note, that you need to use your own project name instead of my (gcr.io/project1), for GCP. Use your own account (login) ID instead MYID. In any case, this works on a local Docker for Desktop. (See docker_build.sh)
docker build -t MYID/myk8spark:1.0 -t MYID/myk8spark:latest \
-t gcr.io/project1/myk8spark:1.0 -t gcr.io/project1/myk8spark:latest .
After building the image, open a terminal to it to see that it's correct. (Your image name may be different.)
docker run -it myk8spark /bin/bash
Try your program, which will be in the location copied by the Dockerfile and referenced in the spark operator yaml file (mainApplicationFile).
$SPARK_HOME/bin/spark-submit k8spark/runpi.py
If your application uses custom ports, you'll need to expose them. See this explanation of accessing the Spark UI on Kubernetes.
The Cloud Code (IDE plugin) can build and deploy the spark application.
- Ensure that the Deployment Configuration points to the correct image repository. It is likley to be localhost:32000 (local cluster) or HOSTNAME:32000 (remote cluster) or some public repository (like DockerHub)
- Ensure that the Kubeconfig file setting is correct in the advanced Deployment Configuration
- The skaffold file should (generally) use the sha256 tag so that new images are deployed.