kubeflow/spark-operator

[QUESTION] spark-submit not called, and other questions

RyanZotti opened this issue · 6 comments

The main reason for this issue is that it seems my PySpark job is not running, and the reason that I think it's not running is that I don't see:

  • any of the typical logs boilerplate of Spark applications (lots of INFO, WARN, etc.)
  • a spark-submit in the logs, which I had previously seen before I made a bunch of changes trying to fix permissions and image pull issues

Below are the steps to reproduce my problem.

# Start with a fresh cluster
minikube start

# Add the repo
helm repo add spark-operator https://kubeflow.github.io/spark-operator

# Install the repo
helm install my-release spark-operator/spark-operator  \
    -f values.yaml \
    --namespace spark-operator \
    --create-namespace \
    --version 1.2.14

# Apply a variety of hacks I had to apply to get around permissions issues
kubectl apply -f trial-and-error-fixes.yaml

# Submit a job
kubectl apply -f spark-py-pi.yaml

# Get the driver's logs
kubectl logs pyspark-pi-driver -n spark-jobs

The logs of the driver, using the last command, show this:

++ id -u

  • myuid=0
    ++ id -g
  • mygid=0
  • set +e
    ++ getent passwd 0
  • uidentry=root:x:0:0:root:/root:/bin/bash
  • set -e
  • echo 0
  • echo 0
  • echo root:x:0:0:root:/root:/bin/bash
    0
    0
    root:x:0:0:root:/root:/bin/bash
  • [[ -z root:x:0:0:root:/root:/bin/bash ]]
  • exec /usr/bin/tini -s -- /usr/bin/spark-operator driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/examples/src/main/python/pi.py

Where is the spark-submit command? I'm expecting the last two lines to be something like:

  • CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$@")
  • exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.16.24.210 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class ...

Contents of spark-py-pi.yaml. I lightly adapted this from the file of the same name provided in the repo's example folder.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: pyspark-pi
  namespace: spark-jobs
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: "kubeflow/spark-operator:v1beta2-1.4.5-3.5.0"
  mainApplicationFile: local:///opt/spark/examples/src/main/python/pi.py
  sparkVersion: "3.5.0"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 3.5.0
    serviceAccount: spark
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 3.5.0

values.yaml:

sparkJobNamespaces: ["spark-jobs"]
webhook:
  enable: true
image:
  repository: docker.io/kubeflow/spark-operator
  tag: v1beta2-1.4.5-3.5.0
spark:
  serviceAccountName: spark
rbac:
  createClusterRole: true
logLevel: 3

trial-and-error-fixes.yaml

apiVersion: v1
kind: Namespace
metadata:
  name: spark-operator
---
apiVersion: v1
kind: Namespace
metadata:
  name: spark-jobs
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark
  namespace: spark-jobs
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: missing-kubeflow-permissions
rules:
  - apiGroups:
    - "sparkoperator.k8s.io"
    resources:
      - "sparkapplications"
      - "scheduledsparkapplications"
      - "sparkapplications/status"
    verbs:
      - "list"
      - "watch"
      - "update"
      - "get"
  - apiGroups:
      - ""
    resources:
      - "pods"
      - "events"
    verbs:
      - "list"
      - "create"
      - "patch"
      - "watch"
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: missing-kubeflow-permissions-role-binding
  namespace: spark-jobs
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: missing-kubeflow-permissions
subjects:
  - kind: ServiceAccount
    name: spark
    namespace: spark-jobs

And while I'm here, I had several other totally unrelated questions I wanted to ask but couldn't find answers to on Google, Stack Overflow or in the docs:

  • I'm new to Kubernetes. Does the Spark Operator support two separate images, one for the operator itself and the other for my Spark application?
  • What is the relationship between a chart version and an image tag? They're 1-1 correct?
  • What do the logLevel numbers map to? Is there a way to circumvent the magic numbers and explicitly specify the level, e.g., INFO or DEBUG?

I was able to run your examples on my cluster (not minikube) using image: "spark:3.5.0" for the SparkApplication

@mereck Yes, that worked for me too. Thanks.

I'm new to Kubernetes and operators in general. Am I correct to assume that the driver and executors can use an image separate from the image of the operator? I assume that's the case, since that would explain why there are two places to specify images, one in the spark application yaml and another in the helm install command, but that wasn't clear to me from the docs.

@RyanZotti glad it worked!

Yes, it seems we can provide two images here, one for running the pod that runs the operator itself, that has different service account and permissions too. Then we can have another image for the SparkApplication which is responsible for spawning driver and executor and uses a different service account with possibly different permissions. During helm install we specify the operator pod's image, but when we write the SparkApplication manifest we can supply a different image. This is quite handy as often we'd need to modify the Driver / Executor images with additional layers for various connectors, such as described in the GCP guide for example.

@mereck That makes sense. As a quick follow-up, inspired by the external jars example you linked to, do you know if I'm doing anything wrong while adding third party jars?

For example, assume a driver/executor Dockerfile like so:

FROM spark:3.5.0

ADD https://repo1.maven.org/maven2/com/mysql/mysql-connector-j/8.3.0/mysql-connector-j-8.3.0.jar /opt/spark/work-dir/

Where I've added the following line under spec to my original spark-py-pi.yaml:

 deps:
  jars:
    - local:///opt/spark/work-dir/mysql-connector-j-8.3.0.jar

Built the image like so:

eval $(minikube docker-env)
docker build -t spark-debug:latest -f Dockerfile

and where I've updated spark-py-pi.yaml to point to the spark-debug:latest image accordingly.

When I do all that I get a message like this:

Files local:///opt/spark/work-dir/mysql-connector-j-8.3.0.jar from /opt/spark/work-dir/mysql-connector-j-8.3.0.jar to /opt/spark/work-dir/mysql-connector-j-8.3.0.jar
Exception in thread "main" java.nio.file.NoSuchFileException: /opt/spark/work-dir/mysql-connector-j-8.3.0.jar

I can see the file at the specified location when I log into an interactive container. Do the jars need to be "local" to the Spark Operator image? I don't think that makes sense, but want to confirm.

It sounds like that jar would only need to be on the SparkApplication image. Have you tried adding the chmod 644 command to Dockerfile as well? Also I think it's better to place jars in the $SPARK_HOME/jars

Here's an example.

If permissions don't help, I would try spinning up a container with bash and explore the file system to check if the file you've added is there:

docker run --rm -it --entrypoint bash <image-name-or-id>

@mereck Thanks! That did the trick, so I'm marking this issue as closed.