Support the Heterogenous(different type of) Intel GPU cards in the same OCP cluster

Question

Support the Heterogenous(different type of) Intel GPU cards in the same OCP cluster

Opened this issue 6 months ago · 2 comments

Summary

Support the heterogeneous (different) Intel GPU cards in the same OCP cluster.

Detail

In the Scenario, When in the same cluster, different Intel GPU cards like Max-1100, Flex-140, and Flex-170 are provisioned. A mechanism should be provided for the users to pick up the proper GPU card they want to run the workloads on.
To align with the taints/tolerance mechanism from Red Hat OpenShift AI accelerator Profile, We will use the same taints/tolerance mechanism for this feature.

To properly label(taint) the nodes in the cluster automatically, we will rely on the NFD node tainting feature.

So this feature rely on issue openshift/cluster-nfd-operator#356

Note

The feature is for the heterogeneous (different) Intel GPU cards in the same OCP cluster.
The different Intel dGPU cards in the same node are not supported.

Answer 1 · 2024-03-05T05:30:30.000Z

/cc @tkatila

Answer 2 · 2024-03-08T15:11:23.000Z

The different Intel dGPU cards in the same node are not supported.

This is only because of GAS support, won't rely on NFD labelling taints/tolerations, correct?

How does this align with future resource requests via DRA? It does seem divergent at first glance