[Sandbox] HAMi

Question

[Sandbox] HAMi

Closed this issue 15 days ago · 27 comments

Application contact emails

xiaozhang0210@hotmail.com,limengxuan@4paradigm.com

Project Summary

Heterogeneous AI Computing Virtualization Middleware (HAMi), is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster.

Project Description

Heterogeneous AI Computing Virtualization Middleware (HAMi) is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It includes everything you would expect, such as:

Heterogeneous AI computing device support, currently supports: Nvidia, Cambricon, Hygon, Huawei Ascend, iluvatar
Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
Task priority: supports tasks using the same AI computing device to define different priorities. When resources are preempted, high-priority tasks have high QOS
CUDA Unified memory: When the GPU memory is not enough, it supports expanded use of node memory.
Easy to use: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than nvidia.com/gpu if you prefer.

The core features of HAMi are as follows

Hard Limit on Device Memory.
Allows partial device allocation by specifying device memory.
Imposes a hard limit on streaming multiprocessors.
flexible binpack&spread schedule policies base on gpu device and node
Permits partial device allocation by specifying device core usage.
Requires zero changes to existing programs.

The HAMi architecture is as follows

Application Scenarios

Device sharing (or device virtualization) on Kubernetes.
Scenarios where pods need to be allocated with specific device memory
Need to balance GPU usage in a cluster with multiple GPU nodes.
Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU.
Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances.

Org repo URL (provide if all repos under the org are in scope of the application)

https://github.com/Project-HAMi

Project repo URL in scope of application

core repo : https://github.com/Project-HAMi/HAMi

And the corresponding multi-public repo https://github.com/Project-HAMi/

Additional repos in scope of the application

No response

Website URL

http://project-hami.io/

Roadmap

https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#roadmap

Roadmap context

Production	manufactor	MemoryIsolation	CoreIsolation	MultiCard support
GPU	NVIDIA	✅	✅	✅
MLU	Cambricon	✅	❌	❌
DCU	Hygon	✅	✅	❌
Ascend	Huawei	In progress	In progress	❌
GPU	iluvatar	In progress	In progress	❌
DPU	Teco	In progress	In progress	❌

Support video codec processing
Support Multi-Instance GPUs (MIG)
Support Flexible scheduling policies
- binpack
- spread
- numa affinity
integrated gpu-operator
Rich observability support
DRA Support
Support Intel GPU device
Support AMD GPU device

Contributing Guide

https://github.com/Project-HAMi/HAMi/blob/master/CONTRIBUTING.md

Here are our community meeting minutes

https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit?usp=sharing

Code of Conduct (CoC)

https://github.com/Project-HAMi/HAMi/blob/master/CODE_OF_CONDUCT.md

Adopters

We have done a survey and found that dozens of adopters are already using HAMi. We will maintain it in the HAMi documentation later. Online survey results

Contributing or Sponsoring Org

4paradigm,DaoCloud, HuaweiCloud,Rise Union

Maintainers file

https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md

IP Policy

If the project is accepted, I agree the project will follow the CNCF IP Policy

Trademark and accounts

If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Why CNCF?

The CNCF is the premier organization for cloud-native technologies and is backed by many leading companies in the industry. It also provides a platform for collaboration and community-building, which can lead to increased visibility, adoption, and contributions to HAMi.

At the same time, HAMi can be combined with more outstanding CNCF projects (such as: Volcano, Kuberay, Kueue) to provide one-stop service for AI infrastructure.

Benefit to the Landscape

As AI becomes more and more popular, different smart devices are springing up, represented by Nvidia, but there are many other smart devices that are also actively embracing K8s and CNCF. But how these numerous GPUs, NPUs and other devices can provide a consistent interactive experience on one platform is particularly important. This is exactly what HAMi is focused on doing. If users use HAMi, it will greatly simplify the management and operation of these GPUs and NPUs on K8s, and the application layer does not need to be aware of the differences in underlying hardware.

Cloud Native 'Fit'

HAMi is built using cloud native technology. It has now used scheduler-plugin, webhook, device-plugin and other technologies to manage and schedule heterogeneous AI computing devices. In the future, it will consider using DRA for architecture optimization.

Cloud Native 'Integration'

HAMi refers to the nvidia device-plugin project part of source codes to support nvidia gpu basic features. On top of this, we support the following functions for nvidia gpu extensions.

Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
hami provides scheduling enhancement capabilities based on kube-scheduler and supports binpack&spread capabilities at the node and gpu device levels.

Cloud Native Overlap

We do not think there is direct overlap at this time with other CNCF projects. However, we do touch on some of the areas that other projects are investigating in the space of device-plugin，and scheduler enhancement.

Volcano also provides the ability to share GPUs. In version v1.8, the features of volcano-vgpu were contributed to the volcano repo by hami maintainer. However, after discussions with the maintainer of volcano, in order to support the independent development of the hami community, it was decided to release it in version v1.9. Later, this part of the function was transferred to the HAMi project and maintained by the HAMi community (repo --> https://github.com/Project-HAMi/volcano-vgpu-device-plugin)

Similar projects

Some comparisons with similar projects to HAMi

highlight

nvidia-device-plugin and k8s-dra-driver only supports nvidia devices and does not support other heterogeneous AI computing devices
nvidia-device-plugin and k8s-dra-driver focuses on the combination of gpu and K8s, and does not focus on scheduling enhancements and rich observability indicators.

Comparison of GPU sharing solutions

Landscape

yes

HAMi is in landscape and also in cnai group

https://landscape.cncf.io/?group=cnai

Business Product or Service to Project separation

N/A

Project presentations

No response

Project champions

No response

Additional information

No response

mrbobbytables commented a month ago

/vote

👍1

Answer 1 · 2024-07-08T18:12:38.000Z

TAG-Runtime

Answer 2 · 2024-07-23T13:31:51.000Z

Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
is the k8s-dra-driver fork for convenience or is it really going to be a fork?

Answer 3 · 2024-07-25T07:04:27.000Z

Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?

is the k8s-dra-driver fork for convenience or is it really going to be a fork?

all public repos are on the scope for donation

k8s-dra-driver are forked for convenience, we plan to make our own dra-driver

Answer 4 · 2024-07-26T05:51:18.000Z

Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?

is the k8s-dra-driver fork for convenience or is it really going to be a fork?

We've been exploring the combination of HAMi and DRA and are currently in the roadmap as well

Answer 5 · 2024-07-30T19:37:13.000Z

@raravena80 has TAG Runtime reviewed this project and have a recommendation to the TOC?

Answer 6 · 2024-07-30T20:36:43.000Z

They presented on May 16th, 2024.

Info:

Slides: https://docs.google.com/presentation/d/1T-fJ-hCIiAlPsnG4WpzEMLGIL3PTIeHE5Lzv-HOEroQ/edit#slide=id.g2cd23b2285d_0_2
Video of presentation: https://youtu.be/SwdEoQYkMsE

TAG-Runtime is good with the project going to Sandbox provide they fulfill the CNCF Sandbox admission checklist.

cc: @srust @miao0miao @rajaskakodkar

Answer 7 · 2024-08-09T11:50:02.000Z

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

Answer 8 · 2024-08-09T13:18:20.000Z

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

Thank you very much. I have made comments in the document and look forward to your reply.

Answer 9 · 2024-08-09T14:51:24.000Z

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

@zanetworker

Thanks again for the very detailed and high-quality review of HAMi. I have replied to all the comments. If you have any questions, please leave a message.

I would like to clarify a few points.

1. Risk of single vendor contribution.

Due to the non-standard contribution method (direct commit, no PR) before, the statistical information is inaccurate. At present, DaoCloud and 4paradigm have similar contributions,

This is the current contributor statistics, https://github.com/Project-HAMi/HAMi/graphs/contributors?from=2021-07-04&to=2024-08-09&type=c

The top eight contributors come from four different vendors(sort by commits), 4paradigm, DaoCloud, SAP,NIVIC

@archlitchi 4paradigm
@wawa0210 DaoCloud
@peizhaoyou 4paradigm
@lengrongfu DaoCloud
@chaunceyjiang DaoCloud
@CoderTH DaoCloud
@haitwang-cloud SAP
@whybeyoung NIVIC
@gsakun independent

Therefore, I understand that there is no risk of single vendor contribution.

Of course, we will standardize the contribution process and look for more contributors in the future.

Answer 10 · 2024-08-09T16:15:13.000Z

Thanks @wawa0210, I have incorporated your comments, and amended the context. Thank you for your collaboration and swift responses on the review.

Answer 11 · 2024-08-09T22:56:19.000Z

TAG Contributor strategy has reviewed this project and found the following:

The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
HAMi does not have written governance, yet.
The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

Answer 12 · 2024-08-12T02:52:08.000Z

TAG Contributor strategy has reviewed this project and found the following:

The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).

HAMi does not have written governance, yet.

The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.

There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.

Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

After discussion with HAMi maintainers, we added a governance document, https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#governance

There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.

HAMi has three maintainers, and eleven community members

Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization

Answer 13 · 2024-08-12T21:07:22.000Z

We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization

Yeah, that's challenging. But, if your contributors speak Chinese, that makes sense for your meetings. And if you can get meeting notes up in Chinese, other folks can use Google Translate. For that reason, notes are better than recordings.

If you get accepted into the CNCF, you'll want to eventually cultivate a second, English-speaking community as well as your Chinese one.

Answer 14 · 2024-08-13T12:30:03.000Z

Regarding cloud native overlap, to elaborate further, the two projects, Volcano and Hami, each concentrate on distinct aspects. The two projects have an close collaboration. Taking GPU sharing as an instance, Volcano offers the scheduling of GPU virtualization resources with policy, while Hami provides the isolation of GPU memory and core on the node. The coordination of the two projects has been adopted by a number of users and has received great feedback.

Answer 15 · 2024-08-20T15:20:53.000Z

Vote created

@mrbobbytables has called for a vote on [Sandbox] HAMi (#97).

The members of the following teams have binding votes:

Team
@cncf/cncf-toc

Non-binding votes are also appreciated as a sign of support!

How to vote

You can cast your vote by reacting to this comment. The following reactions are supported:

In favor	Against	Abstain
👍	👎	👀

Please note that voting for multiple options is not allowed and those votes won't be counted.

The vote will be open for 2months 30days 2h 52m 48s. It will pass if at least 66% of the users with binding votes vote In favor 👍. Once it's closed, results will be published here as a new comment.

Answer 16 · 2024-08-20T15:23:36.000Z

The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:

SIG Node,
SIG Scheduling,
Batch WG,
Device Management WG

Answer 17 · 2024-08-20T15:53:53.000Z

The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:

SIG Node,

SIG Scheduling,

Batch WG,

Device Management WG

Thank you very much for the reminder. It happens that HK Kubecon will start on August 21st, and HAMi maintainers will attend the meeting. We will actively try to communicate with these SIG people, listen to their suggestions for HAMi's future, and enrich the roadmap

Answer 18 · 2024-08-20T22:58:10.000Z

/check-vote

Answer 19 · 2024-08-20T22:58:12.000Z

Vote status

So far 36.36% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor	Against	Abstain	Not voted
4	0	0	7

Binding votes (4)

User	Vote	Timestamp
angellk	In favor	2024-08-20 21:45:19.0 +00:00:00
kevin-wangzefeng	In favor	2024-08-20 19:17:22.0 +00:00:00
TheFoxAtWork	In favor	2024-08-20 15:24:34.0 +00:00:00
cathyhongzhang	In favor	2024-08-20 15:24:10.0 +00:00:00
@dims	Pending
@rochaporto	Pending
@mauilion	Pending
@linsun	Pending
@dzolotusky	Pending
@nikhita	Pending
@kgamanji	Pending

Non-binding votes (1)

User	Vote	Timestamp
wawa0210	In favor	2024-08-20 15:51:30.0 +00:00:00

Answer 20 · 2024-08-21T13:01:04.000Z

Votes can only be checked once a day.

Answer 21 · 2024-08-21T23:17:09.000Z

/check-vote

Answer 22 · 2024-08-21T23:17:11.000Z

Vote status

So far 63.64% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor	Against	Abstain	Not voted
7	0	0	4

Binding votes (7)

User	Vote	Timestamp
dzolotusky	In favor	2024-08-21 13:39:57.0 +00:00:00
linsun	In favor	2024-08-21 13:43:54.0 +00:00:00
angellk	In favor	2024-08-20 21:45:19.0 +00:00:00
cathyhongzhang	In favor	2024-08-20 15:24:10.0 +00:00:00
rochaporto	In favor	2024-08-21 7:27:51.0 +00:00:00
TheFoxAtWork	In favor	2024-08-20 15:24:34.0 +00:00:00
kevin-wangzefeng	In favor	2024-08-20 19:17:22.0 +00:00:00
@dims	Pending
@mauilion	Pending
@nikhita	Pending
@kgamanji	Pending

Non-binding votes (4)

User	Vote	Timestamp
raravena80	In favor	2024-08-20 23:35:09.0 +00:00:00
archlitchi	In favor	2024-08-21 1:34:09.0 +00:00:00
zanetworker	In favor	2024-08-21 11:07:37.0 +00:00:00
wawa0210	In favor	2024-08-21 15:16:48.0 +00:00:00

Answer 23 · 2024-08-23T10:54:04.000Z

Vote closed

The vote passed! 🎉

72.73% of the users with binding vote were in favor (passing threshold: 66%).

Summary

In favor	Against	Abstain	Not voted
8	0	0	3

Binding votes (8)

User	Vote	Timestamp
@cathyhongzhang	In favor	2024-08-20 15:24:10.0 +00:00:00
@kevin-wangzefeng	In favor	2024-08-20 19:17:22.0 +00:00:00
@TheFoxAtWork	In favor	2024-08-20 15:24:34.0 +00:00:00
@dzolotusky	In favor	2024-08-21 13:39:57.0 +00:00:00
@linsun	In favor	2024-08-21 13:43:54.0 +00:00:00
@nikhita	In favor	2024-08-23 10:43:43.0 +00:00:00
@angellk	In favor	2024-08-20 21:45:19.0 +00:00:00
@rochaporto	In favor	2024-08-21 7:27:51.0 +00:00:00

Non-binding votes (4)

User	Vote	Timestamp
@raravena80	In favor	2024-08-20 23:35:09.0 +00:00:00
@archlitchi	In favor	2024-08-21 1:34:09.0 +00:00:00
@zanetworker	In favor	2024-08-21 11:07:37.0 +00:00:00
@wawa0210	In favor	2024-08-21 15:16:48.0 +00:00:00

Answer 24 · 2024-08-29T18:21:53.000Z

Welcome and congrats on getting accepted as a CNCF Sandbox project!

You can get started on your on-boarding checklist here: cncf/toc#1413

and if you have any questions, please don't hesitate to reach out!

Answer 25 · 2024-08-30T07:08:13.000Z

cncf/toc#1413

thanks, we'll working on it

Answer 26 · 2024-09-03T20:41:20.000Z

With cncf/toc#1413 created we can go ahead and close this out :)

Congrats again!