insitro/redun

Job array causes error message on older versions of k8s

Opened this issue · 0 comments

The k8s executor depends on a feature added in k8s v1.24: https://kubernetes.io/docs/tasks/job/indexed-parallel-processing-static/

When I run a job on my EKS cluster using defaults (where max_array_size > 1), which is running v1.21, I see these errors (warnings?):

[redun] Executor[k8s]: Pod redun-job-d64219e107664faab6f1223c52909c0a-array-888pz is missing job-completion-index: {'kubernetes.io/psp': 'rafay-privileged-psp'}

The k8s jobs are all in Error state, and the workflow never finishes because it gets that error.

We already have code that should be detecting versions less than v1.21
https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L418
but I think these code path still execute:
https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L478
and
https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L771

To repro, I think you could use minikube to install v1.23 or earlier and then run redun in it.
To fix, I think you could remove the warning at
https://github.com/insitro/redun/blob/main/redun/executors/k8s.py#L771
and properly handle tasks that are missing that field.