[Sandbox] HAMi
Closed this issue · 27 comments
Application contact emails
xiaozhang0210@hotmail.com,limengxuan@4paradigm.com
Project Summary
Heterogeneous AI Computing Virtualization Middleware (HAMi), is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster.
Project Description
Heterogeneous AI Computing Virtualization Middleware (HAMi) is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It includes everything you would expect, such as:
- Heterogeneous AI computing device support, currently supports: Nvidia, Cambricon, Hygon, Huawei Ascend, iluvatar
- Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
- Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
- Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as
nvidia.com/use-gputype
ornvidia.com/nouse-gputype
. - Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as
nvidia.com/use-gpuuuid
ornvidia.com/nouse-gpuuuid
. - Task priority: supports tasks using the same AI computing device to define different priorities. When resources are preempted, high-priority tasks have high QOS
- CUDA Unified memory: When the GPU memory is not enough, it supports expanded use of node memory.
- Easy to use: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than
nvidia.com/gpu
if you prefer.
The core features of HAMi are as follows
- Hard Limit on Device Memory.
- Allows partial device allocation by specifying device memory.
- Imposes a hard limit on streaming multiprocessors.
- flexible binpack&spread schedule policies base on gpu device and node
- Permits partial device allocation by specifying device core usage.
- Requires zero changes to existing programs.
The HAMi architecture is as follows
Application Scenarios
- Device sharing (or device virtualization) on Kubernetes.
- Scenarios where pods need to be allocated with specific device memory
- Need to balance GPU usage in a cluster with multiple GPU nodes.
- Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU.
- Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances.
Org repo URL (provide if all repos under the org are in scope of the application)
https://github.com/Project-HAMi
Project repo URL in scope of application
core repo : https://github.com/Project-HAMi/HAMi
And the corresponding multi-public repo https://github.com/Project-HAMi/
Additional repos in scope of the application
No response
Website URL
Roadmap
https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#roadmap
Roadmap context
Production | manufactor | MemoryIsolation | CoreIsolation | MultiCard support |
---|---|---|---|---|
GPU | NVIDIA | ✅ | ✅ | ✅ |
MLU | Cambricon | ✅ | ❌ | ❌ |
DCU | Hygon | ✅ | ✅ | ❌ |
Ascend | Huawei | In progress | In progress | ❌ |
GPU | iluvatar | In progress | In progress | ❌ |
DPU | Teco | In progress | In progress | ❌ |
- Support video codec processing
- Support Multi-Instance GPUs (MIG)
- Support Flexible scheduling policies
- binpack
- spread
- numa affinity
- integrated gpu-operator
- Rich observability support
- DRA Support
- Support Intel GPU device
- Support AMD GPU device
Contributing Guide
https://github.com/Project-HAMi/HAMi/blob/master/CONTRIBUTING.md
Here are our community meeting minutes
https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit?usp=sharing
Code of Conduct (CoC)
https://github.com/Project-HAMi/HAMi/blob/master/CODE_OF_CONDUCT.md
Adopters
We have done a survey and found that dozens of adopters are already using HAMi. We will maintain it in the HAMi documentation later. Online survey results
Contributing or Sponsoring Org
4paradigm,DaoCloud, HuaweiCloud,Rise Union
Maintainers file
https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md
IP Policy
- If the project is accepted, I agree the project will follow the CNCF IP Policy
Trademark and accounts
- If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF
Why CNCF?
The CNCF is the premier organization for cloud-native technologies and is backed by many leading companies in the industry. It also provides a platform for collaboration and community-building, which can lead to increased visibility, adoption, and contributions to HAMi.
At the same time, HAMi can be combined with more outstanding CNCF projects (such as: Volcano, Kuberay, Kueue) to provide one-stop service for AI infrastructure.
Benefit to the Landscape
As AI becomes more and more popular, different smart devices are springing up, represented by Nvidia, but there are many other smart devices that are also actively embracing K8s and CNCF. But how these numerous GPUs, NPUs and other devices can provide a consistent interactive experience on one platform is particularly important. This is exactly what HAMi is focused on doing. If users use HAMi, it will greatly simplify the management and operation of these GPUs and NPUs on K8s, and the application layer does not need to be aware of the differences in underlying hardware.
Cloud Native 'Fit'
HAMi is built using cloud native technology. It has now used scheduler-plugin, webhook, device-plugin and other technologies to manage and schedule heterogeneous AI computing devices. In the future, it will consider using DRA for architecture optimization.
Cloud Native 'Integration'
HAMi refers to the nvidia device-plugin project part of source codes to support nvidia gpu basic features. On top of this, we support the following functions for nvidia gpu extensions.
- Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
- Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
- Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as
nvidia.com/use-gputype
ornvidia.com/nouse-gputype
. - Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as
nvidia.com/use-gpuuuid
ornvidia.com/nouse-gpuuuid
. - hami provides scheduling enhancement capabilities based on kube-scheduler and supports binpack&spread capabilities at the node and gpu device levels.
Cloud Native Overlap
We do not think there is direct overlap at this time with other CNCF projects. However, we do touch on some of the areas that other projects are investigating in the space of device-plugin,and scheduler enhancement.
Volcano also provides the ability to share GPUs. In version v1.8, the features of volcano-vgpu were contributed to the volcano repo by hami maintainer. However, after discussions with the maintainer of volcano, in order to support the independent development of the hami community, it was decided to release it in version v1.9. Later, this part of the function was transferred to the HAMi project and maintained by the HAMi community (repo --> https://github.com/Project-HAMi/volcano-vgpu-device-plugin)
Similar projects
Some comparisons with similar projects to HAMi
highlight
- nvidia-device-plugin and k8s-dra-driver only supports nvidia devices and does not support other heterogeneous AI computing devices
- nvidia-device-plugin and k8s-dra-driver focuses on the combination of gpu and K8s, and does not focus on scheduling enhancements and rich observability indicators.
Comparison of GPU sharing solutions
Landscape
yes
HAMi is in landscape and also in cnai group
https://landscape.cncf.io/?group=cnaiBusiness Product or Service to Project separation
N/A
Project presentations
No response
Project champions
No response
Additional information
No response
TAG-Runtime
Project repo URL in scope of application
lists just the main repo, are the other repos out of scope for donation?- is the
k8s-dra-driver
fork for convenience or is it really going to be a fork?
Project repo URL in scope of application
lists just the main repo, are the other repos out of scope for donation?- is the
k8s-dra-driver
fork for convenience or is it really going to be a fork?
all public repos are on the scope for donation
k8s-dra-driver are forked for convenience, we plan to make our own dra-driver
Project repo URL in scope of application
lists just the main repo, are the other repos out of scope for donation?- is the
k8s-dra-driver
fork for convenience or is it really going to be a fork?
We've been exploring the combination of HAMi and DRA and are currently in the roadmap as well
@raravena80 has TAG Runtime reviewed this project and have a recommendation to the TOC?
They presented on May 16th, 2024.
Info:
- Slides: https://docs.google.com/presentation/d/1T-fJ-hCIiAlPsnG4WpzEMLGIL3PTIeHE5Lzv-HOEroQ/edit#slide=id.g2cd23b2285d_0_2
- Video of presentation: https://youtu.be/SwdEoQYkMsE
TAG-Runtime is good with the project going to Sandbox provide they fulfill the CNCF Sandbox admission checklist.
Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit
FYI @angellk @raravena80 @srust @rajaskakodkar
Some feedback is still needed from the authors in the doc for completeness.
Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit
FYI @angellk @raravena80 @srust @rajaskakodkar
Some feedback is still needed from the authors in the doc for completeness.
Thank you very much. I have made comments in the document and look forward to your reply.
Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit
FYI @angellk @raravena80 @srust @rajaskakodkar
Some feedback is still needed from the authors in the doc for completeness.
Thanks again for the very detailed and high-quality review of HAMi. I have replied to all the comments. If you have any questions, please leave a message.
I would like to clarify a few points.
1. Risk of single vendor contribution.
Due to the non-standard contribution method (direct commit, no PR) before, the statistical information is inaccurate. At present, DaoCloud and 4paradigm have similar contributions,
This is the current contributor statistics, https://github.com/Project-HAMi/HAMi/graphs/contributors?from=2021-07-04&to=2024-08-09&type=c
The top eight contributors come from four different vendors(sort by commits), 4paradigm, DaoCloud, SAP,NIVIC
@archlitchi 4paradigm
@wawa0210 DaoCloud
@peizhaoyou 4paradigm
@lengrongfu DaoCloud
@chaunceyjiang DaoCloud
@CoderTH DaoCloud
@haitwang-cloud SAP
@whybeyoung NIVIC
@gsakun independent
Therefore, I understand that there is no risk of single vendor contribution.
Of course, we will standardize the contribution process and look for more contributors in the future.
Thanks @wawa0210, I have incorporated your comments, and amended the context. Thank you for your collaboration and swift responses on the review.
TAG Contributor strategy has reviewed this project and found the following:
- The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
- HAMi does not have written governance, yet.
- The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
- There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
- Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.
This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.
TAG Contributor strategy has reviewed this project and found the following:
- The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
- HAMi does not have written governance, yet.
- The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
- There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
- Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.
This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.
After discussion with HAMi maintainers, we added a governance document, https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#governance
There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
HAMi has three maintainers, and eleven community members
Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.
We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization
We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization
Yeah, that's challenging. But, if your contributors speak Chinese, that makes sense for your meetings. And if you can get meeting notes up in Chinese, other folks can use Google Translate. For that reason, notes are better than recordings.
If you get accepted into the CNCF, you'll want to eventually cultivate a second, English-speaking community as well as your Chinese one.
Regarding cloud native overlap, to elaborate further, the two projects, Volcano and Hami, each concentrate on distinct aspects. The two projects have an close collaboration. Taking GPU sharing as an instance, Volcano offers the scheduling of GPU virtualization resources with policy, while Hami provides the isolation of GPU memory and core on the node. The coordination of the two projects has been adopted by a number of users and has received great feedback.
/vote
Vote created
@mrbobbytables has called for a vote on [Sandbox] HAMi
(#97).
The members of the following teams have binding votes:
Team |
---|
@cncf/cncf-toc |
Non-binding votes are also appreciated as a sign of support!
How to vote
You can cast your vote by reacting to this
comment. The following reactions are supported:
In favor | Against | Abstain |
---|---|---|
👍 | 👎 | 👀 |
Please note that voting for multiple options is not allowed and those votes won't be counted.
The vote will be open for 2months 30days 2h 52m 48s
. It will pass if at least 66%
of the users with binding votes vote In favor 👍
. Once it's closed, results will be published here as a new comment.
The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:
- SIG Node,
- SIG Scheduling,
- Batch WG,
- Device Management WG
The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:
- SIG Node,
- SIG Scheduling,
- Batch WG,
- Device Management WG
Thank you very much for the reminder. It happens that HK Kubecon will start on August 21st, and HAMi maintainers will attend the meeting. We will actively try to communicate with these SIG people, listen to their suggestions for HAMi's future, and enrich the roadmap
/check-vote
Vote status
So far 36.36%
of the users with binding vote are in favor (passing threshold: 66%
).
Summary
In favor | Against | Abstain | Not voted |
---|---|---|---|
4 | 0 | 0 | 7 |
Binding votes (4)
User | Vote | Timestamp |
---|---|---|
angellk | In favor | 2024-08-20 21:45:19.0 +00:00:00 |
kevin-wangzefeng | In favor | 2024-08-20 19:17:22.0 +00:00:00 |
TheFoxAtWork | In favor | 2024-08-20 15:24:34.0 +00:00:00 |
cathyhongzhang | In favor | 2024-08-20 15:24:10.0 +00:00:00 |
@dims | Pending | |
@rochaporto | Pending | |
@mauilion | Pending | |
@linsun | Pending | |
@dzolotusky | Pending | |
@nikhita | Pending | |
@kgamanji | Pending |
Non-binding votes (1)
User | Vote | Timestamp |
---|---|---|
wawa0210 | In favor | 2024-08-20 15:51:30.0 +00:00:00 |
Votes can only be checked once a day.
/check-vote
Vote status
So far 63.64%
of the users with binding vote are in favor (passing threshold: 66%
).
Summary
In favor | Against | Abstain | Not voted |
---|---|---|---|
7 | 0 | 0 | 4 |
Binding votes (7)
User | Vote | Timestamp |
---|---|---|
dzolotusky | In favor | 2024-08-21 13:39:57.0 +00:00:00 |
linsun | In favor | 2024-08-21 13:43:54.0 +00:00:00 |
angellk | In favor | 2024-08-20 21:45:19.0 +00:00:00 |
cathyhongzhang | In favor | 2024-08-20 15:24:10.0 +00:00:00 |
rochaporto | In favor | 2024-08-21 7:27:51.0 +00:00:00 |
TheFoxAtWork | In favor | 2024-08-20 15:24:34.0 +00:00:00 |
kevin-wangzefeng | In favor | 2024-08-20 19:17:22.0 +00:00:00 |
@dims | Pending | |
@mauilion | Pending | |
@nikhita | Pending | |
@kgamanji | Pending |
Non-binding votes (4)
User | Vote | Timestamp |
---|---|---|
raravena80 | In favor | 2024-08-20 23:35:09.0 +00:00:00 |
archlitchi | In favor | 2024-08-21 1:34:09.0 +00:00:00 |
zanetworker | In favor | 2024-08-21 11:07:37.0 +00:00:00 |
wawa0210 | In favor | 2024-08-21 15:16:48.0 +00:00:00 |
Vote closed
The vote passed! 🎉
72.73%
of the users with binding vote were in favor (passing threshold: 66%
).
Summary
In favor | Against | Abstain | Not voted |
---|---|---|---|
8 | 0 | 0 | 3 |
Binding votes (8)
User | Vote | Timestamp |
---|---|---|
@cathyhongzhang | In favor | 2024-08-20 15:24:10.0 +00:00:00 |
@kevin-wangzefeng | In favor | 2024-08-20 19:17:22.0 +00:00:00 |
@TheFoxAtWork | In favor | 2024-08-20 15:24:34.0 +00:00:00 |
@dzolotusky | In favor | 2024-08-21 13:39:57.0 +00:00:00 |
@linsun | In favor | 2024-08-21 13:43:54.0 +00:00:00 |
@nikhita | In favor | 2024-08-23 10:43:43.0 +00:00:00 |
@angellk | In favor | 2024-08-20 21:45:19.0 +00:00:00 |
@rochaporto | In favor | 2024-08-21 7:27:51.0 +00:00:00 |
Non-binding votes (4)
User | Vote | Timestamp |
---|---|---|
@raravena80 | In favor | 2024-08-20 23:35:09.0 +00:00:00 |
@archlitchi | In favor | 2024-08-21 1:34:09.0 +00:00:00 |
@zanetworker | In favor | 2024-08-21 11:07:37.0 +00:00:00 |
@wawa0210 | In favor | 2024-08-21 15:16:48.0 +00:00:00 |
Welcome and congrats on getting accepted as a CNCF Sandbox project!
You can get started on your on-boarding checklist here: cncf/toc#1413
and if you have any questions, please don't hesitate to reach out!
thanks, we'll working on it
With cncf/toc#1413 created we can go ahead and close this out :)
Congrats again!