kubernetes-sigs/jobset

JobSet TTL to clean up completed workloads

danielvegamyhre opened this issue · 21 comments

We should look into having JobSet respect TTL so completed JobSets can be deleted from etcd and not take up space

https://kubernetes.io/docs/concepts/workloads/controllers/ttlafterfinished/

So I am a bit confused by this.

Are you saying we should have a field in JobSet that will purge JobSet from etcd giving a TTL?

Or are you wanting someone to test if one sets TTL in a Job, will the Job finish and then get deleted. And even if the JobSet was marked as success (when Jobs all finish), the JobSet will recreate the finished job because it was deleted?

So individual jobs can use ttl without any issue.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: simple-with-ttl
spec:
  replicatedJobs:
  - name: leader
    replicas: 1
    template:
      spec:
        # Set backoff limit to 0 so job will immediately fail if any pod fails.
        backoffLimit: 0
        ttlSecondsAfterFinished: 60
        completions: 1
        parallelism: 1
        template:
          spec:
            containers:
            - name: leader
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 100
  - name: workers
    replicas: 1
    template:
      spec:
        backoffLimit: 0
        ttlSecondsAfterFinished: 60
        completions: 2
        parallelism: 2
        template:
          spec:
            containers:
            - name: worker
              image: bash:latest
              command:
              - bash
              - -xc
              - |
                sleep 100
~                            

Jobs are cleaned up and deleted from etcd but JobSet will still mark this was successful.

I think we could add a ttl for JobSet as the JobSet will not be deleted. IMO this would be a nice feature for any CRD though..

Thanks for testing ttlSecondsAfterFinished for child jobs! So to clarify, the jobset controller did not recreate the child jobs that were cleaned up after the TTL? Just want to confirm since I created this issue when a user mentioned this happened to them, jobs were recreated. They could have just been misunderstanding what they were seeing though.

Yea I didn’t see jobs recreated in my case.

Sounds good, thanks for taking a look at this.

I think we actually need a TTL in JobSet to improve the user experience with Kueue integration. Right now, users have to manually delete all Workloads since JobSet has no TTL, so they either need to do it manually or have some process doing it periodically. If JobSet had a TTL, the JobSet + Workload object would be cleaned up automatically.

Can you clean up the description? I think this could be a good first issue if we clear the objectives.

Add a ttlafterFinished field, delete finished jobs sets after that time.

My questions I would have is do we delete jobs and pods of finished jobsets? That would make debugging difficult but if they finished I guess we don’t need it.

/help
/good-first-issue

@kannon92:
This request has been marked as suitable for new contributors.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue command.

In response to this:

/help
/good-first-issue

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

I have capacity to help with this issue if no one else is looking at it.

Can you clean up the description? I think this could be a good first issue if we clear the objectives.

Add a ttlafterFinished field, delete finished jobs sets after that time.

My questions I would have is do we delete jobs and pods of finished jobsets? That would make debugging difficult but if they finished I guess we don’t need it.

I guess we should only agree on the final question from @kannon92 and it should be a clear path to the finish line.

/assign

@danielvegamyhre what is your opinion on what Kevin asked for cleaning up jobs and pods after jobsets finishes?

After the jobset finishes and the TTL has been reached, then we want the jobset and it's child resources (jobs, pods, service, etc.) to be deleted.

Thanks @danielvegamyhre, I am good to proceed :)

/retitle JobSet TTL to clean up completed workloads

I was testing this feature. I noticed that pods and jobs were getting deleted with ttl but I did not see jobsets getting deleted.

That is not good, do you want to open an issue with any logs you have noticed?

@dejanzele can we add an e2e test as well?

Sorry, I was mistaken.

I created the example ttl job and noticed that jobset was still around but jobs/pods were deleted.

But I tried testing again and I did actually see the jobset being deleted eventually. I'm not sure they will happen at the same time but I do see ttl working with JobSet/Jobs/Pods.

@ahg-g ok, I'll create a PR