keylimetoolbox/resque-kubernetes

Pods that have failed should not be reaped

Closed this issue · 0 comments

When the job/pod reaper runs it removes all jobs where job.spec.completions == job.status.succeeded and all pods where pod.status.phase == "Succeeded". Apparently this includes pods that were OOMKilled.

$ kubectl get pods -a
NAME                                               READY     STATUS      RESTARTS   AGE
downloads-o8hbo-z8hh7       0/1       Completed   0          23m
downloads-tu1tv-pj5w4       0/1       Completed   0          23m
downloads-xnayd-dq7c4       0/1       OOMKilled   0          23m
downloads-ys1ro-kj07v       0/1       OOMKilled   0          23m

$ kubectl get jobs
NAME                                       DESIRED   SUCCESSFUL   AGE
downloads-o8hbo     1         1            23m
downloads-tu1tv     1         1            23m
downloads-xnayd     1         1            23m
downloads-ys1ro     1         1            23m

This is because the job is counted as "successful" and when the reaper removes the job it also removes it's child pods. (From Job docs: "When you delete the job using kubectl, all the pods it created are deleted too.")

We don't need to reap pods anymore (I believe we used to in older versions of k8s). We can rely on the job reaping to clean that up.

However, we should keep the jobs that have failed pods around so that the logs can be inspected.

Note also, that we set restartPolicy to "onFailure" by default. The docs state that it should restart the container if it was "killed for exceeding a memory limit", so I'm not really sure why these failed and were shut down.