UPDATED for Spark 3.1.3
We will install Spark Operator, test Spark on Kubernetes, build and launch our own custom images in VK Cloud
https://spark.apache.org/docs/latest/running-on-kubernetes.html
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/quick-start-guide.md
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator/blob/master/docs/user-guide.md
https://youtu.be/hQI-QYJXlVU?t=12317
It would be easier to create a host VM in cloud
All work in the cloud can be done from that VM
You can install kubectl, Helm, Docker and all other things on this VM and don`t mess with your own local machine
How to create VM: https://mcs.mail.ru/help/ru_RU/create-vm/vm-quick-create
How to connect: https://mcs.mail.ru/help/ru_RU/vm-connect/vm-connect-nix
Steps:
- Create VM
- Connect to VM with SSH
- Perform all steps described further in this instruction from this VM
Dataset: https://disk.yandex.ru/d/gn19jm6mVBnwzQ
Instruction: https://mcs.mail.ru/help/kubernetes/clusterfast
You may have trouble with Gatekeeper.
So please delete it. https://mcs.mail.ru/docs/base/k8s/k8s-addons/k8s-gatekeeper/k8s-opa#udalenie
Kubernetes as a Service: https://mcs.mail.ru/app/services/containers/add/
https://mcs.mail.ru/help/ru_RU/k8s-start/connect-k8s
export KUBECONFIG=/replace_with_path/to_your_kubeconfig.yaml
alias k=kubectl
source <(kubectl completion bash)
complete -F __start_kubectl k
https://helm.sh/docs/intro/install/
https://docs.docker.com/engine/install/ubuntu/
https://docs.docker.com/engine/reference/commandline/login/
https://ropenscilabs.github.io/r-docker-tutorial/04-Dockerhub.html
http://spark.apache.org/downloads.html
Please note. For this tutorial i am using spark-3.1.3-bin-hadoop3.2
Extract the Tarball
wget https://archive.apache.org/dist/spark/spark-3.1.3/spark-3.1.3-bin-hadoop3.2.tgz
tar -xvzf spark-3.1.3-bin-hadoop3.2.tgz
nano ~/.profile
export SPARK_HOME=~/spark-3.1.3-bin-hadoop3.2
alias spark-shell=”$SPARK_HOME/bin/spark-shell”
source ~/.profile
helm repo add spark-operator https://googlecloudplatform.github.io/spark-on-k8s-operator
##Here we install specific version (1.1.25) because sometimes last versions has bugs.
helm install my-release spark-operator/spark-operator --namespace spark-operator --create-namespace --set webhook.enable=true --version 1.1.25
#create service account, role and rolebinding for spark
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
namespace: default
name: spark-role
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["*"]
- apiGroups: [""]
resources: ["services"]
verbs: ["*"]
- apiGroups: [""]
resources: ["configmaps"]
verbs: ["*"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-role-binding
namespace: default
subjects:
- kind: ServiceAccount
name: spark
namespace: default
roleRef:
kind: Role
name: spark-role
apiGroup: rbac.authorization.k8s.io
EOF
git clone git clone https://github.com/stockblog/webinar_spark_k8s/ webinar_spark_k8s
kubectl apply -f webinar_spark_k8s/yamls_configs/spark-pi.yaml
kubectl get sparkapplications.sparkoperator.k8s.io
kubectl describe sparkapplications.sparkoperator.k8s.io spark-pi
kubectl get pods
kubectl logs spark-pi-driver | grep 3.1
We will use docker-image-tool.sh and docker build context for this tool is $SPARK_HOME. So we need to clone spark .py files inside $SPARK_HOME, so they will be accessible within docker image.
You could read more about docker-image-tool.sh: https://spark.apache.org/docs/latest/running-on-kubernetes.html#docker-images
More information about docker build context: https://docs.docker.com/engine/reference/builder/
Additonal info about custom docker image for spark: https://www.waitingforcode.com/apache-spark/docker-images-apache-spark-applications/read
cd
git clone https://github.com/stockblog/webinar_spark_k8s/ webinar_spark_k8s
mv webinar_spark_k8s/custom_jobs/ $SPARK_HOME
export YOUR_DOCKER_REPO=
#example export YOUR_DOCKER_REPO=mcscloud
sudo /$SPARK_HOME/bin/docker-image-tool.sh -r $YOUR_DOCKER_REPO -t spark_k8s_intel -p ~/webinar_spark_k8s/yamls_configs/Dockerfile build
sudo /$SPARK_HOME/bin/docker-image-tool.sh -r $YOUR_DOCKER_REPO -t spark_k8s_intel -p ~/webinar_spark_k8s/yamls_configs/Dockerfile push
You can obtain credintials for S3 access and create buckets with this help:
https://mcs.mail.ru/help/ru_RU/s3-start/s3-account
https://mcs.mail.ru/help/ru_RU/s3-start/create-bucket
Attention: create simple bucket name without _ or any special symbols. Because sometimes there are strange glitches with special symbols in bucket names.
kubectl create secret generic s3-secret --from-literal=S3_ACCESS_KEY='PLACE_YOUR_S3_CRED_HERE' --from-literal=S3_SECRET_KEY='PLACE_YOUR_S3_CRED_HERE'
#REPLACE S3_PATH AND S3_WRITE_PATH WITH YOUR PARAMETERS
kubectl create configmap s3path-config --from-literal=S3_PATH='s3a://s3-demo/evo_train_new.csv' --from-literal=S3_WRITE_PATH='s3a://s3-demo/write/evo_train_csv/'
#choose one of example yamls in directory yamls_configs, edit yaml, add your parameters such as docker repo, image, path to files in s3
kubectl apply -f ~/webinar_spark_k8s/yamls_configs/s3read_write_with_secret_cfgmap.yaml
kubectl get sparkapplications.sparkoperator.k8s.io
kubectl describe sparkapplications.sparkoperator.k8s.io s3read-write-test
kubectl get pods
kubectl logs pod_name
kubectl get events
https://github.com/helm/charts/tree/master/stable/spark-history-server
#create namespace for History Server
kubectl create ns spark-history-server
#create secret so History Server could write to S3
kubectl create secret generic s3-secret --from-literal=S3_ACCESS_KEY='PLACE_YOUR_S3_CRED_HERE' --from-literal=S3_SECRET_KEY='PLACE_YOUR_S3_CRED_HERE' -n spark-history-server
#create yaml file with config for History Server
#you should create bucket in S3 named spark-hs and directory inside bucket named spark-hs or change names in s3.logDirectory parameter
helm repo add stable https://charts.helm.sh/stable
#Edit values-hs.yaml. You should specify your logDirectory param.
helm install -f ~/webinar_spark_k8s/yamls_configs/values-hs.yaml my-spark-history-server stable/spark-history-server --namespace spark-history-server
kubectl get service -n spark-history-server
#Edit s3_hs_server_test.yaml before launch. Fill with your own params.
kubectl apply -f ~/webinar_spark_k8s/yamls_configs/s3_hs_server_test.yaml
Go to external IP of History Server and check logs of your spark app WARNING: In prod you should not expose your Spark History Server to all internet. Use service type ClusterIp with VPNaaS or other solutions.