kubernetes-sigs/jobset
JobSet: a k8s native API for distributed ML training and HPC workloads
GoApache-2.0
Pinned issues
Issues
- 10
JobSetTemplate API
#573 opened by ahg-g - 11
- 28
Support for the execution policy API in JobSet
#672 opened by andreyvelich - 0
Add Examples for Failure Policy Actions
#600 opened by jedwins1998 - 0
Add global replica count label/annotation to support multislice training workloads with different templates
#676 opened by GiuseppeTT - 11
Support Autoscaling replicatedJob
#570 opened by tenzen-y - 8
- 1
Active jobs are not deleted when the job finishes if TTLSecondsAfterFinished is set
#666 opened by CecileRobertMichon - 3
Register labels/annotations on Kubernetes Website
#639 opened by kannon92 - 9
Release v0.6.0 requirements
#523 opened by danielvegamyhre - 6
Release v0.6.0
#655 opened by danielvegamyhre - 0
Add global Job index label/annotation to provide a global index for each job across the entire JobSet
#649 opened by danielvegamyhre - 9
Jobset minMember support
#621 opened by song-william - 1
- 3
feature: Topology Domain with JobSet
#637 opened by googs1025 - 4
Allow to mutate PodTemplate when suspending a JobSet and support resuming such JobSet
#624 opened by mimowo - 2
add monitoring metrics for jobset
#613 opened by googs1025 - 1
- 5
docs: added use cases for using prometheus-operator
#626 opened by googs1025 - 13
- 18
Support Running condition
#571 opened by tenzen-y - 3
- 9
Support Stateful JobSet
#572 opened by tenzen-y - 10
Publish JobSet API reference
#566 opened by mimowo - 13
feat: add JobIndex to container env
#592 opened by googs1025 - 25
- 4
- 5
Wait for the webhook service to be listening before advertising the Jobset replica as ready.
#607 opened by mbobrovskyi - 2
MASTER_ADDR setting for mnist example
#603 opened by song-william - 2
- 2
Fix Bugs related to Configurable Failure Policy
#588 opened by jedwins1998 - 1
- 2
update kind e2e test version in jobset
#587 opened by googs1025 - 4
The e2e tests on 1.28 fail consistently
#583 opened by mimowo - 5
- 6
Add Job name label to pods
#578 opened by danielvegamyhre - 9
- 6
Add support for feature gates
#556 opened by danielvegamyhre - 8
Periodic Jobs for Testing Release Branches
#524 opened by kannon92 - 3
CustomResourceDefinition too long
#536 opened by kyle-google - 4
Add integration test for changes in PR #562
#563 opened by danielvegamyhre - 4
JobSet controller should not reconcile JobSets with deletion timestamp set (bug when deleting JobSets using foreground cascading deletion policy)
#561 opened by danielvegamyhre - 3
[bug] Reconciler error log, "the object has been modified; please apply your changes to the latest version and try again"
#555 opened by googs1025 - 4
- 5
- 0
Configurable Failure Policy KEP Has a Typo
#538 opened by jedwins1998 - 2
Bug in testing utility function JobSetActive
#530 opened by danielvegamyhre - 1
Move from using openapi-gen
#514 opened by kannon92 - 3
Release v0.5.0
#515 opened by danielvegamyhre - 1
Add TTL example to docsite
#519 opened by danielvegamyhre