cncf/sandbox

[Sandbox] HAMi

Closed this issue · 27 comments

Application contact emails

xiaozhang0210@hotmail.com,limengxuan@4paradigm.com

Project Summary

Heterogeneous AI Computing Virtualization Middleware (HAMi), is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster.

Project Description

Heterogeneous AI Computing Virtualization Middleware (HAMi) is an "all-in-one" tool designed to manage Heterogeneous AI Computing Devices in a k8s cluster. It includes everything you would expect, such as:

  1. Heterogeneous AI computing device support, currently supports: Nvidia, Cambricon, Hygon, Huawei Ascend, iluvatar
  2. Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
  3. Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
  4. Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
  5. Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
  6. Task priority: supports tasks using the same AI computing device to define different priorities. When resources are preempted, high-priority tasks have high QOS
  7. CUDA Unified memory: When the GPU memory is not enough, it supports expanded use of node memory.
  8. Easy to use: You don't need to modify your task YAML to use our scheduler. All your jobs will be automatically supported after installation. Additionally, you can specify a resource name other than nvidia.com/gpu if you prefer.

The core features of HAMi are as follows

  • Hard Limit on Device Memory.
  • Allows partial device allocation by specifying device memory.
  • Imposes a hard limit on streaming multiprocessors.
  • flexible binpack&spread schedule policies base on gpu device and node
  • Permits partial device allocation by specifying device core usage.
  • Requires zero changes to existing programs.

The HAMi architecture is as follows

image

Application Scenarios

  1. Device sharing (or device virtualization) on Kubernetes.
  2. Scenarios where pods need to be allocated with specific device memory
  3. Need to balance GPU usage in a cluster with multiple GPU nodes.
  4. Low utilization of device memory and computing units, such as running 10 TensorFlow servings on one GPU.
  5. Situations that require a large number of small GPUs, such as teaching scenarios where one GPU is provided for multiple students to use, and cloud platforms that offer small GPU instances.

Org repo URL (provide if all repos under the org are in scope of the application)

https://github.com/Project-HAMi

Project repo URL in scope of application

core repo : https://github.com/Project-HAMi/HAMi

And the corresponding multi-public repo https://github.com/Project-HAMi/

Additional repos in scope of the application

No response

Website URL

http://project-hami.io/

Roadmap

https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#roadmap

Roadmap context

Production manufactor MemoryIsolation CoreIsolation MultiCard support
GPU NVIDIA
MLU Cambricon
DCU Hygon
Ascend Huawei In progress In progress
GPU iluvatar In progress In progress
DPU Teco In progress In progress
  • Support video codec processing
  • Support Multi-Instance GPUs (MIG)
  • Support Flexible scheduling policies
    • binpack
    • spread
    • numa affinity
  • integrated gpu-operator
  • Rich observability support
  • DRA Support
  • Support Intel GPU device
  • Support AMD GPU device

Contributing Guide

https://github.com/Project-HAMi/HAMi/blob/master/CONTRIBUTING.md

Here are our community meeting minutes

https://docs.google.com/document/d/1YC6hco03_oXbF9IOUPJ29VWEddmITIKIfSmBX8JtGBw/edit?usp=sharing

Code of Conduct (CoC)

https://github.com/Project-HAMi/HAMi/blob/master/CODE_OF_CONDUCT.md

Adopters

We have done a survey and found that dozens of adopters are already using HAMi. We will maintain it in the HAMi documentation later. Online survey results

Contributing or Sponsoring Org

4paradigm,DaoCloud, HuaweiCloud,Rise Union

Maintainers file

https://github.com/Project-HAMi/HAMi/blob/master/MAINTAINERS.md

IP Policy

  • If the project is accepted, I agree the project will follow the CNCF IP Policy

Trademark and accounts

  • If the project is accepted, I agree to donate all project trademarks and accounts to the CNCF

Why CNCF?

The CNCF is the premier organization for cloud-native technologies and is backed by many leading companies in the industry. It also provides a platform for collaboration and community-building, which can lead to increased visibility, adoption, and contributions to HAMi.

At the same time, HAMi can be combined with more outstanding CNCF projects (such as: Volcano, Kuberay, Kueue) to provide one-stop service for AI infrastructure.

Benefit to the Landscape

As AI becomes more and more popular, different smart devices are springing up, represented by Nvidia, but there are many other smart devices that are also actively embracing K8s and CNCF. But how these numerous GPUs, NPUs and other devices can provide a consistent interactive experience on one platform is particularly important. This is exactly what HAMi is focused on doing. If users use HAMi, it will greatly simplify the management and operation of these GPUs and NPUs on K8s, and the application layer does not need to be aware of the differences in underlying hardware.

Cloud Native 'Fit'

HAMi is built using cloud native technology. It has now used scheduler-plugin, webhook, device-plugin and other technologies to manage and schedule heterogeneous AI computing devices. In the future, it will consider using DRA for architecture optimization.

Cloud Native 'Integration'

HAMi refers to the nvidia device-plugin project part of source codes to support nvidia gpu basic features. On top of this, we support the following functions for nvidia gpu extensions.

  1. Device sharing: Each task can allocate a portion of a device instead of the entire device, allowing a device to be shared among multiple tasks.
  2. Device Memory Control: Devices can be allocated a specific device memory size (e.g., 3000M) or a percentage of the whole GPU's memory (e.g., 50%), ensuring it does not exceed the specified boundaries.
  3. Device Type Specification: You can specify the type of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gputype or nvidia.com/nouse-gputype.
  4. Device UUID Specification: You can specify the UUID of device to use or avoid for a particular task by setting annotations, such as nvidia.com/use-gpuuuid or nvidia.com/nouse-gpuuuid.
  5. hami provides scheduling enhancement capabilities based on kube-scheduler and supports binpack&spread capabilities at the node and gpu device levels.

Cloud Native Overlap

We do not think there is direct overlap at this time with other CNCF projects. However, we do touch on some of the areas that other projects are investigating in the space of device-plugin,and scheduler enhancement.

Volcano also provides the ability to share GPUs. In version v1.8, the features of volcano-vgpu were contributed to the volcano repo by hami maintainer. However, after discussions with the maintainer of volcano, in order to support the independent development of the hami community, it was decided to release it in version v1.9. Later, this part of the function was transferred to the HAMi project and maintained by the HAMi community (repo --> https://github.com/Project-HAMi/volcano-vgpu-device-plugin)

Similar projects

Some comparisons with similar projects to HAMi
image

highlight

  • nvidia-device-plugin and k8s-dra-driver only supports nvidia devices and does not support other heterogeneous AI computing devices
  • nvidia-device-plugin and k8s-dra-driver focuses on the combination of gpu and K8s, and does not focus on scheduling enhancements and rich observability indicators.

Comparison of GPU sharing solutions

image

Landscape

yes

image

HAMi is in landscape and also in cnai group

image https://landscape.cncf.io/?group=cnai

Business Product or Service to Project separation

N/A

Project presentations

No response

Project champions

No response

Additional information

No response

TAG-Runtime

dims commented
  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?
  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

all public repos are on the scope for donation

k8s-dra-driver are forked for convenience, we plan to make our own dra-driver

  • Project repo URL in scope of application lists just the main repo, are the other repos out of scope for donation?
  • is the k8s-dra-driver fork for convenience or is it really going to be a fork?

We've been exploring the combination of HAMi and DRA and are currently in the roadmap as well

@raravena80 has TAG Runtime reviewed this project and have a recommendation to the TOC?

They presented on May 16th, 2024.

Info:

TAG-Runtime is good with the project going to Sandbox provide they fulfill the CNCF Sandbox admission checklist.

cc: @srust @miao0miao @rajaskakodkar

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

Thank you very much. I have made comments in the document and look forward to your reply.

Hami project review: https://docs.google.com/document/d/1Lb4HYnJR21AEsNGurtXcXqEzdrKu95cG0NziufGEI0c/edit

FYI @angellk @raravena80 @srust @rajaskakodkar

Some feedback is still needed from the authors in the doc for completeness.

@zanetworker

Thanks again for the very detailed and high-quality review of HAMi. I have replied to all the comments. If you have any questions, please leave a message.

I would like to clarify a few points.

1. Risk of single vendor contribution.

Due to the non-standard contribution method (direct commit, no PR) before, the statistical information is inaccurate. At present, DaoCloud and 4paradigm have similar contributions,

This is the current contributor statistics, https://github.com/Project-HAMi/HAMi/graphs/contributors?from=2021-07-04&to=2024-08-09&type=c

The top eight contributors come from four different vendors(sort by commits), 4paradigm, DaoCloud, SAP,NIVIC

@archlitchi 4paradigm
@wawa0210 DaoCloud
@peizhaoyou 4paradigm
@lengrongfu DaoCloud
@chaunceyjiang DaoCloud
@CoderTH DaoCloud
@haitwang-cloud SAP
@whybeyoung NIVIC
@gsakun independent

Therefore, I understand that there is no risk of single vendor contribution.

Of course, we will standardize the contribution process and look for more contributors in the future.

Thanks @wawa0210, I have incorporated your comments, and amended the context. Thank you for your collaboration and swift responses on the review.

TAG Contributor strategy has reviewed this project and found the following:

  • The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
  • HAMi does not have written governance, yet.
  • The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
  • There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
  • Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

TAG Contributor strategy has reviewed this project and found the following:

  • The contributor guide is very basic, particularly as it does not cover the current actual contributor process (as mentioned upthread).
  • HAMi does not have written governance, yet.
  • The roadmap is a brief checklist in the project README, mainly focused on future devices and device features to support. It appears to have been updated a few times over the last year.
  • There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.
  • Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

This review is for the TOC’s information only. Sandbox projects are not required to have full governance or contributor documentation.

After discussion with HAMi maintainers, we added a governance document, https://github.com/Project-HAMi/HAMi?tab=readme-ov-file#governance

There are three maintainers, who work for DaoCloud, HuaweiCloud, and 4Paradigm. As previously noted, contributor numbers may be misleading, but GitHub shows 24.

HAMi has three maintainers, and eleven community members

Community meetings are held in Chinese, and appear to go back more than a year. While agendas are public, we found no public notes or recordings.

We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization

We currently have a weekly community meeting in Chinese,this is our calendar, there is also a developer WeChat group, which currently has 137 members. Regarding public meeting minutes and screen recordings, this is indeed missing and needs to be improved. At the same time, we also need to pay attention to internationalization

Yeah, that's challenging. But, if your contributors speak Chinese, that makes sense for your meetings. And if you can get meeting notes up in Chinese, other folks can use Google Translate. For that reason, notes are better than recordings.

If you get accepted into the CNCF, you'll want to eventually cultivate a second, English-speaking community as well as your Chinese one.

Regarding cloud native overlap, to elaborate further, the two projects, Volcano and Hami, each concentrate on distinct aspects. The two projects have an close collaboration. Taking GPU sharing as an instance, Volcano offers the scheduling of GPU virtualization resources with policy, while Hami provides the isolation of GPU memory and core on the node. The coordination of the two projects has been adopted by a number of users and has received great feedback.

/vote

Vote created

@mrbobbytables has called for a vote on [Sandbox] HAMi (#97).

The members of the following teams have binding votes:

Team
@cncf/cncf-toc

Non-binding votes are also appreciated as a sign of support!

How to vote

You can cast your vote by reacting to this comment. The following reactions are supported:

In favor Against Abstain
👍 👎 👀

Please note that voting for multiple options is not allowed and those votes won't be counted.

The vote will be open for 2months 30days 2h 52m 48s. It will pass if at least 66% of the users with binding votes vote In favor 👍. Once it's closed, results will be published here as a new comment.

The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:

  • SIG Node,
  • SIG Scheduling,
  • Batch WG,
  • Device Management WG

The TOC would also like the project to engage with the following Kubernetes groups in addition to completing the recommendations from the TAG:

  • SIG Node,
  • SIG Scheduling,
  • Batch WG,
  • Device Management WG

Thank you very much for the reminder. It happens that HK Kubecon will start on August 21st, and HAMi maintainers will attend the meeting. We will actively try to communicate with these SIG people, listen to their suggestions for HAMi's future, and enrich the roadmap

/check-vote

Vote status

So far 36.36% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
4 0 0 7

Binding votes (4)

User Vote Timestamp
angellk In favor 2024-08-20 21:45:19.0 +00:00:00
kevin-wangzefeng In favor 2024-08-20 19:17:22.0 +00:00:00
TheFoxAtWork In favor 2024-08-20 15:24:34.0 +00:00:00
cathyhongzhang In favor 2024-08-20 15:24:10.0 +00:00:00
@dims Pending
@rochaporto Pending
@mauilion Pending
@linsun Pending
@dzolotusky Pending
@nikhita Pending
@kgamanji Pending

Non-binding votes (1)

User Vote Timestamp
wawa0210 In favor 2024-08-20 15:51:30.0 +00:00:00

Votes can only be checked once a day.

/check-vote

Vote status

So far 63.64% of the users with binding vote are in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
7 0 0 4

Binding votes (7)

User Vote Timestamp
dzolotusky In favor 2024-08-21 13:39:57.0 +00:00:00
linsun In favor 2024-08-21 13:43:54.0 +00:00:00
angellk In favor 2024-08-20 21:45:19.0 +00:00:00
cathyhongzhang In favor 2024-08-20 15:24:10.0 +00:00:00
rochaporto In favor 2024-08-21 7:27:51.0 +00:00:00
TheFoxAtWork In favor 2024-08-20 15:24:34.0 +00:00:00
kevin-wangzefeng In favor 2024-08-20 19:17:22.0 +00:00:00
@dims Pending
@mauilion Pending
@nikhita Pending
@kgamanji Pending

Non-binding votes (4)

User Vote Timestamp
raravena80 In favor 2024-08-20 23:35:09.0 +00:00:00
archlitchi In favor 2024-08-21 1:34:09.0 +00:00:00
zanetworker In favor 2024-08-21 11:07:37.0 +00:00:00
wawa0210 In favor 2024-08-21 15:16:48.0 +00:00:00

Vote closed

The vote passed! 🎉

72.73% of the users with binding vote were in favor (passing threshold: 66%).

Summary

In favor Against Abstain Not voted
8 0 0 3

Binding votes (8)

User Vote Timestamp
@cathyhongzhang In favor 2024-08-20 15:24:10.0 +00:00:00
@kevin-wangzefeng In favor 2024-08-20 19:17:22.0 +00:00:00
@TheFoxAtWork In favor 2024-08-20 15:24:34.0 +00:00:00
@dzolotusky In favor 2024-08-21 13:39:57.0 +00:00:00
@linsun In favor 2024-08-21 13:43:54.0 +00:00:00
@nikhita In favor 2024-08-23 10:43:43.0 +00:00:00
@angellk In favor 2024-08-20 21:45:19.0 +00:00:00
@rochaporto In favor 2024-08-21 7:27:51.0 +00:00:00

Non-binding votes (4)

User Vote Timestamp
@raravena80 In favor 2024-08-20 23:35:09.0 +00:00:00
@archlitchi In favor 2024-08-21 1:34:09.0 +00:00:00
@zanetworker In favor 2024-08-21 11:07:37.0 +00:00:00
@wawa0210 In favor 2024-08-21 15:16:48.0 +00:00:00

Welcome and congrats on getting accepted as a CNCF Sandbox project!

You can get started on your on-boarding checklist here: cncf/toc#1413

and if you have any questions, please don't hesitate to reach out!

cncf/toc#1413

thanks, we'll working on it

With cncf/toc#1413 created we can go ahead and close this out :)

Congrats again!