elixir-cloud-aai/TESK

Set job TTL or implement garbage collector

Opened this issue · 3 comments

Hey,

I am running long workflows which produce a lot of tasks and each task has 4 jobs generating more than 300 jobs/pods which have pvs/pvcs attached to them until they are removed. I could not find anything in the docs about garbage collections or setting a TTL for the jobs created. It would be nice if the jobs were removed if they were successful as they are taking up a lot of space with their pvs.

You are right.
There are 2 parts of that story.

  1. There is no Job TTL at the moment, nor Task deletion via API at the moment and I agree this needs to come. Alternatively, discussed for a long time, but never implemented - a proper DB storage for task metadata and an end to abusing K8s Objects for that.
  2. The behaviour of PVC handling has actually changed and TESK used to delete PVCs at the end of the task. It still does, but because they have set finalisers starting from one of K8s versions, they are not removed until their respective pods. The easy change here would actually to remove the finalisers programatically via K8s API and let PVCs go and keep the Jobs/Pods for now.

Hey @aniewielska, we raised the same issue on the tesk-api repository. In our case, we will use the "ttlSecondsAfterFinished" property on a Job using the latest version the Kubernetes client java.

Did you plan to have the enhancement on your side or could we plan a collaboration ?

Cheers !

@aniewielska

The behaviour of PVC handling has actually changed and TESK used to delete PVCs at the end of the task. It still does, but because they have set finalisers starting from one of K8s versions, they are not removed until their respective pods. The easy change here would actually to remove the finalisers programatically via K8s API and let PVCs go and keep the Jobs/Pods for now.

I'm implementing a k8s cronjob that removes all the task's created pods so that the PVs are also removed. The cronjob just looks for TESK's completed jobs and deletes the pods created for them.

I'm not sure though when it is safe to delete the pods. For example for task-ea0cb881

NAME                          COMPLETIONS   DURATION   AGE
task-ea0cb881                 1/1           2d1h       2d1h
task-ea0cb881-ex-00           1/1           41m        2d1h
task-ea0cb881-inputs-filer    1/1           3m34s      2d1h
task-ea0cb881-outputs-filer   1/1           17s        2d1h

Should I delete the pod after a job is in completed state? Should I just check if task-ea0cb881 has finished?