NVIDIA/nvidia-container-runtime

Nvidia Container Runtime supports Reading NVIDIA_VISIBLE_DEVICES from file and updating it to Oci Spec file of container

happy2048 opened this issue · 2 comments

Hi,container toolkit team:

1.need

The nvidia-container-runtime is used to add following part to OCI Spec file of container.

    "hooks": {
        "prestart": [
            {
                "path": "/usr/bin/nvidia-container-runtime-hook",
                "args": [
                    "/usr/bin/nvidia-container-runtime-hook",
                    "prestart"
                ]
            }
        ]
    },

Can it implement the following features?

  • I have a file(eg: /run/nvidia_env.txt) and write some environment variables to it,eg:
$ cat /run/nvidia_env.txt

NVIDIA_VISIBLE_DEVICES=0,1
NVIDIA_DRIVER_CAPABILITIES=compute
USER_CUSTOM_ENV=hello
  • If the container has been set a specific environment variable(eg: NVIDIA_ENV_FROM_FILE=/run/nvidia_env.txt), Then the nvidia-container-runtime read the file /run/nvidia_env.txt and update it to the OCI Spec file of the container?

2.scenes

In order to support multiple pods sharing a GPU card in k8s, we implemented a GPU sharing scheduler(https://github.com/AliyunContainerService/gpushare-scheduler-extender) and the device plugin(https://github.com/AliyunContainerService/gpushare-device-plugin).

Now, we have a new design scheme for gpushare-device-plugin, which is described as follows:

  • a pod requests some gpu memory(eg: 5GiB).
  • gpushare scheduler allocate a gpu card of a node to this pod and write the result to the pod annotation,eg:
apiVersion: v1
kind: Pod
metadata:
  annotations:
    gpushare.aliyun.com/gpu-mem: gpu0  // scheduler allocate 5GiB of gpu0 to the pod
  • pod is scheduled to a node.
  • kubelet invokes the Allocate function of the gpushare-device-plugin,this function is used to convert the device ids passed by kubelet into a value of the environment variable NVIDIA_VISIBLE_DEVICES. But the gpushare-device-plugin can not detect which pod are these device ids assigned by kubelet in this phase, so we can't get the annotations written by gpusharing scheduler of the pod and can not set the environment variable NVIDIA_VISIBLE_DEVICES.

In order to solve this problem, we designed the execution logic of the Allocate function as follows:

  • suppose the device id list passed by kubelet is fake0, fake1, fake2.
  • make a hash of this device id list to get the first n characters of the hash value, for example: abcdefg.
  • set the environment variable NVIDIA_ENV_FROM_FILE=/run/abcdefg.txt
  • return the function

image

At the same time, we will implement the following logic in the PreStartContainer function of the gpushare device plugin:

image

Finally, when nvidia-container-runtime modified the oci spec of container and found that the container had an environment variable NVIDIA_ENV_FROM_FILE, read the environment variable from the file /run/abcdefg.txt and update to the OCI Spec of the container.

Can someone give me an answer?

elezar commented

Note that our architecture has changed significantly since this issue was created. We are moving towards making richer edits to the incoming OCI Specification using CDI as a means for vendors such as NVIDIA to define the edits required.

Furthermore, in the context of Kubernetes something like Dynamic Resource Allocation (see the example driver) may be more applicable to the use cases that you are proposing. There is also work in progress for a DRA driver for NVIDIA GPUs.