intel/intel-device-plugins-for-kubernetes

GPU product node label supports only one product type

Opened this issue · 17 comments

Describe the bug
gpu.intel.com/product only supports one type, such as 'Flex_140' or 'Flex_170'; in the case when both types of cards are installed only 'Flex_140' will be written as the label value (due to order in https://github.com/intel/intel-device-plugins-for-kubernetes/blob/main/deployments/nfd/overlays/node-feature-rules/platform-labeling-rules.yaml). This is important when selecting for differing pod preferences, ie a pod may have better performance on Flex 170 than Flex 140, so an affinity for scheduling for nodes with Flex 170 is preferred. Other logic may handle the actual consumption of that resource.

To Reproduce
Steps to reproduce the behavior:

  1. Node has Flex 140 card installed and Flex 170 card installed
  2. Device plugins is installed per instructions
  3. apply platform-labeling-rules.yaml
  4. Only gpu.intel.com/product=Flex_140 is in labels

Expected behavior
The values for gpu.intel.com/product should follow the other labelled values, such as gpu.intel.com/device-id.0380-56c0.present=true or
gpu.intel.com/device-id.0380-56c1.present=true. At present, pods may select for these labels, which are not transparent to Intel product naming. Suggest something like gpu.intel.com/product.Flex_140.present=true etc. which could indicate any presence of commonly understood Intel GPU product names.

System (please complete the following information):

  • OpenShift 4.13.11
  • Device plugins version: v0.28.0
  • Hardware info: SPR with Flex 140 + Flex 170 in same system

Additional context
n/a

Hi @brgavino

The GPU plugin is supposed to be run on homogeneous nodes (only one type of GPU). As both cards appear as same resource (gpu.intel.com/i915), you wouldn't be able to assign Pods to different GPUs on the same node regardless of the node's labels. Given the constraints, the gpu.intel.com/product works as specified.

Does your use case demand having different types of GPUs running on the same node?

SGShen commented

It is common to have two types of GPUs mixed in usage, when this node or a cluster is supposed to support different workloads. GAS should have a way to differentiate 170 and 140, which may be better and more cost-efficient than the other one. While the formal solution may take a long time, can we have a quick workaround?

If you are referring to GPU Aware Scheduling with GAS, it states in its README that it expects cluster to be homogeneous.
I'm sorry, but I don't think there are quick workarounds for this. Unless you consider to reconfiguring the cluster to only have one type of GPU per node.
For not-so-quick solutions, we could name the resource by product type. But that leads to Pod specs having wrong resource requests.

One way that we have considered this is to present a stand-in, such as requesting memory.max that will be larger than one card/tile type. For the case of sharedDevNum=1, Flex170 has more available memory (16G vs 12G total on Flex140), there should be no possibility to request another card type. In any case, heterogenous GPU has two areas to address, in this case the node labelling should be extensible anyways - the choice to pick the last label in the NFD rules isn't suitable for edge deployments. I'll most likely create an issue in GAS repo as well with this reference.

"For not-so-quick solutions, we could name the resource by product type."
Can we have multiple labels for one resource? For example, keep the resource name the same and add another label for the type.

"For not-so-quick solutions, we could name the resource by product type." Can we have multiple labels for one resource? For example, keep the resource name the same and add another label for the type.

I'm not sure I understand. Resource is what is requested by the pod, and what the devices are associate with. Whereas labels are associated only with nodes (not devices), and used by k8s scheduler for limiting set of eligible nodes for running given pod.

Or did you meant annotating the pod for GAS?

One way that we have considered this is to present a stand-in, such as requesting memory.max that will be larger than one card/tile type. For the case of sharedDevNum=1, Flex170 has more available memory (16G vs 12G total on Flex140), there should be no possibility to request another card type.

Your thought process is valid, but that's not how GAS works. As it assumes the nodes to be homogeneous, the "memory.max" value is divided between the GPUs. For a node with Flex 170 and Flex 140 cards, it would consider each card to have ~9.3GB of memory (16+6+6)/3.

But. Since this problem seems to be a recurring thing, either multiple dGPUs or iGPU + dGPU, I have planned to implement an alternative resource naming method. There user would be able to provide a "pci device id" to "resource name" mapping. Any GPUs on the mapping would get renamed and the ones not on the mapping would be registered via the default name. It will have down sides: GAS would not support it and Pod specs would need adaptation. Maybe something else as well.

Opinions are welcome.

Sounds good to me.

While GAS could use the same PCI ID <-> resource name mapping (if it's specified e.g. in a shared/secured configMap) to know about the devices, adding support for per-node devices with different amount of resource could be very large effort. Probably better to handle that with DRA driver. @uniemimu ?

Correct me if I got this wrong - The problem seems a limitation of the naming of Flex GPU in the NFD. It cannot tell two types (or more types) of GPU nodes in one server node. So GAS cannot schedule the pod to use a specific type of GPU. Adding resource name mapping seems a little bit overkill in this case -- how many resources would be needed?

Correct me if I got this wrong - The problem seems a limitation of the naming of Flex GPU in the NFD. It cannot tell two types (or more types) of GPU nodes in one server node. So GAS cannot schedule the pod to use a specific type of GPU. Adding resource name mapping seems a little bit overkill in this case -- how many resources would be needed?

It's not about the labels. Let's say we had the labels correct in a sense you'd like them to be. A node has two GPUs Flex 140 and Flex 170:

gpu.intel.com/product.Flex_140=1
gpu.intel.com/product.Flex_170=1

The pod spec would declare nodeAffinity to gpu.intel.com/product.Flex_170. During scheduling scheduler selects the node with the correct label and GPU plugin is called to allocate a GPU. GPU plugin selects either of the possible GPUs because the i915 resource equals any of them. The container gets a Flex 140 GPU or Flex 170 GPU. The label on the node or the nodeAffinity in the Pod spec doesn't change the way the GPU plugin selects the GPU during allocate.

GAS doesn't help either, because it only pre-selects the GPU to use by analyzing the extended resources (mllicores, memory.max). GAS doesn't know if card0 is Flex 140 or Flex 170. Though, it would be possible to add that kind of support by extending the node labels etc.

Sounds good to me.

While GAS could use the same PCI ID <-> resource name mapping (if it's specified e.g. in a shared/secured configMap) to know about the devices, adding support for per-node devices with different amount of resource could be very large effort. Probably better to handle that with DRA driver. @uniemimu ?

It does look like the DRA driver would be a better fit in the long term. Based on the discussion here, it seems that DRA is the preferred path forward and the device plugins for GPU are legacy and may not get / be worth implementing such features?

GAS doesn't help either, because it only pre-selects the GPU to use by analyzing the extended resources (mllicores, memory.max). GAS doesn't know if card0 is Flex 140 or Flex 170. Though, it would be possible to add that kind of support by extending the node labels etc.

Right; Dividing the memory.max up equally can't help steer workloads on heterogenous hardware configs (current implementation) and there's no way to ask for affinity for devices at the pod level - or for cards to avoid per node in node selectors, taints, etc. So even if the labels were extended, GAS would need to support the label parsing logic to select the right resources based on the pod labels, for example.

The high level story here, though, is that "As a GPU workload, I need to pick the type of GPU I land on". Other questions on specific resources requests seems handled by DRA.

It does look like the DRA driver would be a better fit in the long term. Based on the discussion here, it seems that DRA is the preferred path forward and the device plugins for GPU are legacy and may not get / be worth implementing such features?

I believe DRA is the way forward, but it is still alpha in K8s 1.29 and requires a feature flag to work. So device plugins will still be around for quite some time.

As your request is not alone, I am leaning towards adding the necessary logic to allow user to rename the GPU resources based on the GPU type. It might hit our next release (~April) depending on how much time I can steal from my other tasks.

Hi, @tkatila , Thanks for your explanations!

Now I get the picture. To summarize, the affinity preference (gpu_type) of a pod level becomes useless after the pod is scheduled to a server node - because the GPU plugins or GAS cannot differentiate the gpu_type of each GPU device in the server.

Regarding your solution -- a "pci device id" to "resource name" mapping, do you mean something like pci_device_id to "gpu_type" mapping?

Regarding your solution -- a "pci device id" to "resource name" mapping, do you mean something like pci_device_id to "gpu_type" mapping?

Yes. Plugin would read device's PCI ID and then based on it apply the default or a custom resource name to it. For example, one could rename an integrated TGL GPU as "gpu.intel.com/i915-tgl" or a Flex 170 (56c0) as "gpu.intel.com/i915-flex170". The editable part would be the "i915" postfix, we need to keep the namespace as it is.

I think it would also make sense to provide example mappings for some of the cases.

Yes. The proposal makes sense. Thanks!

Another couple of items, that we may want to look at (if these should be tracked as feature requests elsewhere, please let me know):

  • memory.max is split equally, not on actual card capability -> This can cause some issues even when we get the right device type, cards can be overscheduled
  • device-id.**.count incorrectly lists type count when different cards are deployed -> ex) gpu.intel.com/device-id.0380-56c1.count: '3' but there is 1 Flex 140 card installed and 1 Flex 170 card installed (missing 56c0, and added to 56c1)
  • memory.max is split equally, not on actual card capability -> This can cause some issues even when we get the right device type, cards can be overscheduled

AFAIK that's not going to change. These extended resources are intended to be used with GAS, and it is supposed to only work with homogeneous clusters and it will most likely stay like that.

  • device-id.**.count incorrectly lists type count when different cards are deployed -> ex) gpu.intel.com/device-id.0380-56c1.count: '3' but there is 1 Flex 140 card installed and 1 Flex 170 card installed (missing 56c0, and added to 56c1)

Yep. That's because the NFD rule counts devices based on their pci class (0380). And the name of the label is taken from the "first" pci device on the list. The good thing is that it's dynamic, so it works most of the times and for all devices. If we'd want to have counts per GPU type, we'll have to add per GPU (device id) rules. It's probably ok to add that for 0380 class devices as there is a limited amount of them. For 0300 the list would be too long.