kubernetes-sigs/jobset

JobSetTemplate API

ahg-g opened this issue · 10 comments

ahg-g commented

What would you like to be added:
A JobSetTemplate API similar to PodTemplate.

Why is this needed:
APIs building on top of JobSet requires referencing a JobSet spec. The common approach is to embed that JobSet spec inside the higher level API, which makes it hard to validate, the other approach is to reference a template.

ahg-g commented

/feature

/kind feature

Hello, I want to share some simple ideas, I don’t know if they are what we need.

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSetTemplate
metadata:
  name: my-jobset-template
spec:
  failurePolicy:
    maxRestarts: 3
  replicatedJobs:
    - name: workers
      replicas: 1
      template:
        spec:
          backoffLimit: 0
          completions: 2
          parallelism: 2
          template:
            spec:
              containers:
                - name: worker
                  image: bash:latest
                  command:
                    - bash
                    - -xc
                    - |
                      sleep 1000
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: my-jobset
spec:
  templateRef:
    name: my-jobset-template 
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: paralleljobs
spec:
  replicatedJobs:
    - name: workers
      templateRef: my-jobset-template
    - name: driver
      templateRef: my-jobset-template
---
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSetTemplate
metadata:
  name: my-jobset-template
spec:
  replicas: 3
  template:
    spec:
      parallelism: 1
      completions: 1
      backoffLimit: 0
      template:
        spec:
          containers:
            - name: sleep
              image: busybox
              command:
                - sleep
              args:
                - 100s

If this approach is correct, perhaps we need another CR object and a controller to manage it.
I'm sorry if I misunderstood. Please forgive me if I got it wrong.

@ahg-g @danielvegamyhre @kannon92 Could you please check if this is the way I understand it? If so, I will take it when I have time and write a kep design document

I’d look at how CronJob uses JobTemplates or even how JobSet uses a JobTemplate.

A user should create a jobset without using the templates.

TrainJob could specify a template and that template would be used to create a Jobset. I think that’s the flow.

Generally the templates are used if someone wants to compose the object.

I’d look at how CronJob uses JobTemplates or even how JobSet uses a JobTemplate.

A user should create a jobset without using the templates.

TrainJob could specify a template and that template would be used to create a Jobset. I think that’s the flow.

Generally the templates are used if someone wants to compose the object.

Perhaps we can create a JobSetTemplateController to manage objects like JobSetTemplate. JobSetTemplate is template metadata. JobSet objects can reference this object. But I'm not sure if this is a good design

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations.
For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.

Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.

Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

As my understanding, @ahg-g mentioned that he wants to try supporting this JobSetTemplate feature regardless of TrainigOperator v2.

According to this proposal: kubeflow/training-operator#2171, we are planning to create TrainingRuntime and ClusterTrainingRuntime to represent blueprints for various ML training or HPC configurations. For LLMs runtimes, we will support list of different templates to fine-tune open-source foundational models.
Since we directly using JobSet API in the TrainingRuntime, I am wondering do we still need JobSetTemplates ?

As my understanding, @ahg-g mentioned that he wants to try supporting this JobSetTemplate feature regardless of TrainigOperator v2.

Yes, we have another use case where JobSetTemplate would be useful - I can't elaborate much further right now since it isn't public yet, but there are definitely other use cases :)

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale