Pods that have failed should not be reaped
Closed this issue · 0 comments
When the job/pod reaper runs it removes all jobs where job.spec.completions == job.status.succeeded
and all pods where pod.status.phase == "Succeeded"
. Apparently this includes pods that were OOMKilled
.
$ kubectl get pods -a
NAME READY STATUS RESTARTS AGE
downloads-o8hbo-z8hh7 0/1 Completed 0 23m
downloads-tu1tv-pj5w4 0/1 Completed 0 23m
downloads-xnayd-dq7c4 0/1 OOMKilled 0 23m
downloads-ys1ro-kj07v 0/1 OOMKilled 0 23m
$ kubectl get jobs
NAME DESIRED SUCCESSFUL AGE
downloads-o8hbo 1 1 23m
downloads-tu1tv 1 1 23m
downloads-xnayd 1 1 23m
downloads-ys1ro 1 1 23m
This is because the job is counted as "successful" and when the reaper removes the job it also removes it's child pods. (From Job docs: "When you delete the job using kubectl, all the pods it created are deleted too.")
We don't need to reap pods anymore (I believe we used to in older versions of k8s). We can rely on the job reaping to clean that up.
However, we should keep the jobs that have failed pods around so that the logs can be inspected.
Note also, that we set restartPolicy
to "onFailure" by default. The docs state that it should restart the container if it was "killed for exceeding a memory limit", so I'm not really sure why these failed and were shut down.