Nvidia Container Runtime supports Reading NVIDIA_VISIBLE_DEVICES from file and updating it to Oci Spec file of container

Question

Nvidia Container Runtime supports Reading NVIDIA_VISIBLE_DEVICES from file and updating it to Oci Spec file of container

happy2048 opened this issue 3 years ago · 2 comments

happy2048 commented 3 years ago

Hi,container toolkit team:

1.need

The nvidia-container-runtime is used to add following part to OCI Spec file of container.

    "hooks": {
        "prestart": [
            {
                "path": "/usr/bin/nvidia-container-runtime-hook",
                "args": [
                    "/usr/bin/nvidia-container-runtime-hook",
                    "prestart"
                ]
            }
        ]
    },

Can it implement the following features?

I have a file(eg: /run/nvidia_env.txt) and write some environment variables to it,eg:

$ cat /run/nvidia_env.txt

NVIDIA_VISIBLE_DEVICES=0,1
NVIDIA_DRIVER_CAPABILITIES=compute
USER_CUSTOM_ENV=hello

If the container has been set a specific environment variable(eg: NVIDIA_ENV_FROM_FILE=/run/nvidia_env.txt), Then the nvidia-container-runtime read the file /run/nvidia_env.txt and update it to the OCI Spec file of the container？

2.scenes

In order to support multiple pods sharing a GPU card in k8s, we implemented a GPU sharing scheduler(https://github.com/AliyunContainerService/gpushare-scheduler-extender) and the device plugin(https://github.com/AliyunContainerService/gpushare-device-plugin).

Now, we have a new design scheme for gpushare-device-plugin, which is described as follows:

a pod requests some gpu memory(eg: 5GiB).
gpushare scheduler allocate a gpu card of a node to this pod and write the result to the pod annotation,eg:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    gpushare.aliyun.com/gpu-mem: gpu0  // scheduler allocate 5GiB of gpu0 to the pod

pod is scheduled to a node.
kubelet invokes the Allocate function of the gpushare-device-plugin,this function is used to convert the device ids passed by kubelet into a value of the environment variable NVIDIA_VISIBLE_DEVICES. But the gpushare-device-plugin can not detect which pod are these device ids assigned by kubelet in this phase, so we can't get the annotations written by gpusharing scheduler of the pod and can not set the environment variable NVIDIA_VISIBLE_DEVICES.

In order to solve this problem, we designed the execution logic of the Allocate function as follows:

suppose the device id list passed by kubelet is fake0, fake1, fake2.
make a hash of this device id list to get the first n characters of the hash value, for example: abcdefg.
set the environment variable NVIDIA_ENV_FROM_FILE=/run/abcdefg.txt
return the function

At the same time, we will implement the following logic in the PreStartContainer function of the gpushare device plugin:

Through the DeviceIds and podResources(refer: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#monitoring-device-plugin-resources)resources of kubelet, locate which pod the device Ids are assigned to by kubelet.
Calculate the hash value of DevcieIds, assuming it is abcdefg.
Read the real GPU device ID assigned by the scheduler from the pod’s annotation, assuming it is GPU0.
Write the environment variable NVIDIA_VISIBLE_DEVCIE=0 (and other very important non-NVIDIA official environment variables) into the file /run/abcdefg.txt for nvidia-container-runtime to process.

Finally, when nvidia-container-runtime modified the oci spec of container and found that the container had an environment variable NVIDIA_ENV_FROM_FILE, read the environment variable from the file /run/abcdefg.txt and update to the OCI Spec of the container.

Answer 1 · 2022-02-17T08:37:22.000Z

Can someone give me an answer?

Answer 2 · 2023-10-20T13:54:10.000Z

Note that our architecture has changed significantly since this issue was created. We are moving towards making richer edits to the incoming OCI Specification using CDI as a means for vendors such as NVIDIA to define the edits required.

Furthermore, in the context of Kubernetes something like Dynamic Resource Allocation (see the example driver) may be more applicable to the use cases that you are proposing. There is also work in progress for a DRA driver for NVIDIA GPUs.