diegodelemos/cap-reuse

Kubernetes Jobs restart containers which execution has failed

Closed this issue · 4 comments

This could lead to a potentially infinite number of restarts of a container, for instance, their execution may be stopped because a dependency can not be met anymore (broken link).

A workflow is composed by a set of steps which are mapped to k8s Jobs. In terms of workflows, the expected behaviour is that if a step fails a log is retrieved and the user gets a message saying that there was a problem, specifically in a certain step.

Open issue on Kubernetes official repository.

Possible solution related with milestone 1. Use the broker to detect if a job has at least 1 failing execution and kill that job so the "restarting till completed" gets stopped. If used with restart policy OnFailure, mandatory to avoid Jobs creation of several new containers to successfully run the task, there will not be a Pod with a failed execution since it will restart forever.

A solution could be to launch Pods (containers) directly without relying on Jobs. If this path is taken we should take care (on the Step Broker) of the Pods that could not be launched because, for instance, a resource quota limit has been met.

https://github.com/diegodelemos/cap-reuse/milestone/1 fixes it. Remove this behaviour once k8s jobs can be configured to avoid exceeding certain number of restarts.