Kubernetes Jobs restart containers which execution has failed
Closed this issue · 4 comments
This could lead to a potentially infinite number of restarts of a container, for instance, their execution may be stopped because a dependency can not be met anymore (broken link).
A workflow is composed by a set of steps which are mapped to k8s Jobs. In terms of workflows, the expected behaviour is that if a step fails a log is retrieved and the user gets a message saying that there was a problem, specifically in a certain step.
Open issue on Kubernetes official repository.
Possible solution related with milestone 1. Use the broker to detect if a job has at least 1 failing execution and kill that job so the "restarting till completed" gets stopped. If used with restart policy OnFailure
, mandatory to avoid Jobs creation of several new containers to successfully run the task, there will not be a Pod with a failed execution since it will restart forever.
A solution could be to launch Pods (containers) directly without relying on Jobs. If this path is taken we should take care (on the Step Broker
) of the Pods that could not be launched because, for instance, a resource quota limit has been met.
https://github.com/diegodelemos/cap-reuse/milestone/1 fixes it. Remove this behaviour once k8s jobs can be configured to avoid exceeding certain number of restarts.