kubernetes-sigs/jobset

Graduate the API to v1

Opened this issue · 18 comments

Graduate JobSet API to v1. We need to keep v1apha1 for a few more releases to make it easier to customers to migrate.

Ref: https://book.kubebuilder.io/multiversion-tutorial/api-changes

@ahg-g and I were discussing skipping v1beta1 and graduating directly to v1, since several companies have been using for their production training workloads for a few months now, so it effectively GA already.

@kannon92 @vsoch what do you think?

It depends on if you think it's really free of errors and potential issues, or not. I don't see any harm in doing v1beta1 and then having that wiggle room, but it's up to you!

I think the main thing about v1 promotion is that means we should not break any existing user. I think we have been pretty careful on not breaking the API anymore so I think its fine to promote to GA.

Could you ask those companies (or even you!) to create an adopters page? It'd be nice to convey that people are actually using this project for something.

I created #398 for the adopters page actually

@vsoch it is not about free of errors, this is more about what commitment we are making to the API stability. Since multiple users are already dependent on it, this is practically becoming GA because everything we do moving forward must be backward compatible, so might as well just make that commitment official in the API.

Could you ask those companies (or even you!) to create an adopters page? It'd be nice to convey that people are actually using this project for something.

we can mention that Google Cloud is using it (we are not at liberty to list the customers); @vsoch if you feel comfortable perhaps we can list Lawrence Livermore National Laboratory?

I will ask! To be clear - "using it" meaning for development and prototyping or in production? We do not have a production Kubernetes cluster. That's what we are working towards.

/assign

Starting this: #518

@ahg-g @danielvegamyhre from the KubeFlow discussions, do we want to table this?

Yes, I think so.

Based on our recent conversations let's have a chat on the next Batch WG and Kubeflow Training WG calls to define actions items. It would be nice to identify list of pending APIs for JobSet V1 for various ML training/fine-tuning use-cases (e.g. PodGroups, Elastic Jobs, Stateful Indexed Jobs, etc.)
We can discuss short and long term goals, and gradually start working on them.
cc @tenzen-y @johnugeorge @terrytangyuan

@ahg-g @andreyvelich can you clarify what aspect of the recent discussions led you to want to pause graduation to v1? I've been having to spend a lot of time on an internal project lately and missed some of the latest conversations I think.

@ahg-g @andreyvelich can you clarify what aspect of the recent discussions led you to want to pause graduation to v1? I've been having to spend a lot of time on an internal project lately and missed some of the latest conversations I think.

A few examples:

  1. Support elastic policy to create HPA for PyTorch elastic.
  2. JobSet doesn't have Restarting and Running conditions.
  3. @ahg-g proposed to introduce the JobSetTemplate that we can deploy together with Training Operator to simplify submission of Distributed PyTorch for users without understanding how to configure environment variables in JobSet.

As we discussed in this thread: https://docs.google.com/document/d/1C2ev7yRbnMTlQWbQCfX7BCCHcLIAGW2MP9f7YeKl2Ck/edit?disco=AAABKU-uQyA
since we want to make JobSet APIs stable in V1, it would be nice to prototype production use-cases for Jax, MPI, or PyTorch to understand if we need to make any changes to the JobSet APIs.

I am happy to discuss it on the next Batch WG call tomorrow if we have time for it. cc @bigsur0

Thanks @andreyvelich, who can help create tracking issues for the first two that describe the requirements in more details? @tenzen-y ?

Thanks @andreyvelich, who can help create tracking issues for the first two that describe the requirements in more details? @tenzen-y ?

Thank you for mentioning me. Yes, I can help to create some issues.
Let me summarize contexts and requirements.

@andreyvelich @ahg-g I raised the dedicated issues there:

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

ahg-g commented

/remove-lifecycle rotten