[QUESTION] spark-submit not called, and other questions
RyanZotti opened this issue · 6 comments
The main reason for this issue is that it seems my PySpark job is not running, and the reason that I think it's not running is that I don't see:
- any of the typical logs boilerplate of Spark applications (lots of
INFO
,WARN
, etc.) - a
spark-submit
in the logs, which I had previously seen before I made a bunch of changes trying to fix permissions and image pull issues
Below are the steps to reproduce my problem.
# Start with a fresh cluster
minikube start
# Add the repo
helm repo add spark-operator https://kubeflow.github.io/spark-operator
# Install the repo
helm install my-release spark-operator/spark-operator \
-f values.yaml \
--namespace spark-operator \
--create-namespace \
--version 1.2.14
# Apply a variety of hacks I had to apply to get around permissions issues
kubectl apply -f trial-and-error-fixes.yaml
# Submit a job
kubectl apply -f spark-py-pi.yaml
# Get the driver's logs
kubectl logs pyspark-pi-driver -n spark-jobs
The logs of the driver, using the last command, show this:
++ id -u
- myuid=0
++ id -g- mygid=0
- set +e
++ getent passwd 0- uidentry=root:x:0:0:root:/root:/bin/bash
- set -e
- echo 0
- echo 0
- echo root:x:0:0:root:/root:/bin/bash
0
0
root:x:0:0:root:/root:/bin/bash- [[ -z root:x:0:0:root:/root:/bin/bash ]]
- exec /usr/bin/tini -s -- /usr/bin/spark-operator driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/examples/src/main/python/pi.py
Where is the spark-submit
command? I'm expecting the last two lines to be something like:
- CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
- exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.16.24.210 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class ...
Contents of spark-py-pi.yaml
. I lightly adapted this from the file of the same name provided in the repo's example
folder.
apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
name: pyspark-pi
namespace: spark-jobs
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: "kubeflow/spark-operator:v1beta2-1.4.5-3.5.0"
mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
sparkVersion: "3.5.0"
restartPolicy:
type: OnFailure
onFailureRetries: 3
onFailureRetryInterval: 10
onSubmissionFailureRetries: 5
onSubmissionFailureRetryInterval: 20
driver:
cores: 1
coreLimit: "1200m"
memory: "512m"
labels:
version: 3.5.0
serviceAccount: spark
executor:
cores: 1
instances: 1
memory: "512m"
labels:
version: 3.5.0
values.yaml
:
sparkJobNamespaces: ["spark-jobs"]
webhook:
enable: true
image:
repository: docker.io/kubeflow/spark-operator
tag: v1beta2-1.4.5-3.5.0
spark:
serviceAccountName: spark
rbac:
createClusterRole: true
logLevel: 3
trial-and-error-fixes.yaml
apiVersion: v1
kind: Namespace
metadata:
name: spark-operator
---
apiVersion: v1
kind: Namespace
metadata:
name: spark-jobs
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-jobs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: missing-kubeflow-permissions
rules:
- apiGroups:
- "sparkoperator.k8s.io"
resources:
- "sparkapplications"
- "scheduledsparkapplications"
- "sparkapplications/status"
verbs:
- "list"
- "watch"
- "update"
- "get"
- apiGroups:
- ""
resources:
- "pods"
- "events"
verbs:
- "list"
- "create"
- "patch"
- "watch"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: missing-kubeflow-permissions-role-binding
namespace: spark-jobs
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: missing-kubeflow-permissions
subjects:
- kind: ServiceAccount
name: spark
namespace: spark-jobs
And while I'm here, I had several other totally unrelated questions I wanted to ask but couldn't find answers to on Google, Stack Overflow or in the docs:
- I'm new to Kubernetes. Does the Spark Operator support two separate images, one for the operator itself and the other for my Spark application?
- What is the relationship between a chart version and an image tag? They're 1-1 correct?
- What do the
logLevel
numbers map to? Is there a way to circumvent the magic numbers and explicitly specify the level, e.g.,INFO
orDEBUG
?
I was able to run your examples on my cluster (not minikube) using image: "spark:3.5.0"
for the SparkApplication
@mereck Yes, that worked for me too. Thanks.
I'm new to Kubernetes and operators in general. Am I correct to assume that the driver and executors can use an image separate from the image of the operator? I assume that's the case, since that would explain why there are two places to specify images, one in the spark application yaml and another in the helm install command, but that wasn't clear to me from the docs.
@RyanZotti glad it worked!
Yes, it seems we can provide two images here, one for running the pod that runs the operator itself, that has different service account and permissions too. Then we can have another image for the SparkApplication which is responsible for spawning driver and executor and uses a different service account with possibly different permissions. During helm install we specify the operator pod's image, but when we write the SparkApplication manifest we can supply a different image. This is quite handy as often we'd need to modify the Driver / Executor images with additional layers for various connectors, such as described in the GCP guide for example.
@mereck That makes sense. As a quick follow-up, inspired by the external jars example you linked to, do you know if I'm doing anything wrong while adding third party jars?
For example, assume a driver/executor Dockerfile like so:
FROM spark:3.5.0
ADD https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.3.0/mysql-connector-j-8.3.0.jar /opt/spark/work-dir/
Where I've added the following line under spec
to my original spark-py-pi.yaml
:
deps:
jars:
- local:///opt/spark/work-dir/mysql-connector-j-8.3.0.jar
Built the image like so:
eval $(minikube docker-env)
docker build -t spark-debug:latest -f Dockerfile
and where I've updated spark-py-pi.yaml
to point to the spark-debug:latest
image accordingly.
When I do all that I get a message like this:
Files local:///opt/spark/work-dir/mysql-connector-j-8.3.0.jar from /opt/spark/work-dir/mysql-connector-j-8.3.0.jar to /opt/spark/work-dir/mysql-connector-j-8.3.0.jar
Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/mysql-connector-j-8.3.0.jar
I can see the file at the specified location when I log into an interactive container. Do the jars need to be "local" to the Spark Operator image? I don't think that makes sense, but want to confirm.
It sounds like that jar would only need to be on the SparkApplication image. Have you tried adding the chmod 644 command to Dockerfile as well? Also I think it's better to place jars in the $SPARK_HOME/jars
Here's an example.
If permissions don't help, I would try spinning up a container with bash and explore the file system to check if the file you've added is there:
docker run --rm -it --entrypoint bash <image-name-or-id>