intel/intel-technology-enabling-for-openshift

P1-Blocker: GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on

vbedida79 opened this issue ยท 6 comments

updates according to @mregmi @vbedida79's comments

Summary

GPU workload can not access the GPU devices from the Container environment without setsebool container_use_devices on

Detail

GPU Workload pods requesting gpu.intel.com/i915 resource cant be executed- until they have access for /dev/drm on the GPU node.
This can be achieved by setting- setsebool container_use_devices on on the host node. This is not feasible to implement if a cluster has multiple GPU nodes and this permission has to be set on each node manually.

Root cause

The /dev/drm access permission is not been added to the container_device_t policy so the access of the /dev/drm is blocked by SELinux which makes the workload app in the can't access the GPU device node files from the container environment.

Solution

  • Work with container-selinux upstream to add the needed permission, and make sure the new container-selinux with the fixing got merged into OCP release.
  • Before it is merged into OCP release, we have to distribute this new policy through user-container-policy project.

Workaround

To ensure all GPU workloads (clinfo, AI inference) work properly, please run the following command on the GPU nodes.

  1. Find all nodes with an Intel Data Center GPU card using the following command:
$ oc get nodes -l intel.feature.node.kubernetes.io/gpu=true

Example output:

NAME         STATUS   ROLES    AGE   VERSION
icx-dgpu-1   Ready    worker   30d   v1.25.4+18eadca
  1. Navigate to the node terminal on the web console (Compute -> Nodes -> Select a node -> Terminal). Run the following commands in the terminal. Repeat step 2 for any other nodes with an Intel Data Center GPU card.
$ chroot /host
$ setsebool container_use_devices on

@vbedida79 which milestone is this targeted for, please add the milestone. Thanks!

Solution with container_device_t option, fails with
type=AVC msg=audit(1693240354.718:400): avc: denied { getattr } for pid=1634424 comm="ls" path="/dev/sda1" dev="devtmpfs" ino=65545 scontext=system_u:system_r:container_device_t:s0:c17,c28 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file permissive=0 type=AVC msg=audit(1693240354.718:401): avc: denied { getattr } for pid=1634424 comm="ls" path="/dev/sda" dev="devtmpfs" ino=65544 scontext=system_u:system_r:container_device_t:s0:c17,c28 tcontext=system_u:object_r:fixed_disk_device_t:s0 tclass=blk_file permissive=0 type=AVC msg=audit(1693240354.718:402): avc: denied { getattr } for pid=1634424 comm="ls" path="/dev/log" dev="devtmpfs" ino=22631 scontext=system_u:system_r:container_device_t:s0:c17,c28 tcontext=system_u:object_r:devlog_t:s0 tclass=lnk_file permissive=0 type=AVC msg=audit(1693240473.700:403): avc: denied { map } for pid=1637667 comm="clinfo" path="/dev/dri/renderD128" dev="devtmpfs" ino=360931 scontext=system_u:system_r:container_device_t:s0:c17,c28 tcontext=system_u:object_r:dri_device_t:s0 tclass=chr_file permissive=0

We need to check for right permissions, or missing options in custom SCC. For v1.0.1, this PR workaround needs to be followed- #104

mregmi commented

The device plugins give access to the devices the workload requested. So workload can run without custom policy/permissions. But on the node SELinux blocks access anyway because it has no concept of containers. That's why they designed that setsebool container_use_devices . So for plugin workloads to have access to the devices plugins exported, The Node must have container_use_devices set to on. its off by default. The alternative is to use custom Selinux Label and SCC.

mregmi commented

@rhatdan We have a question about the how the SELinux label translation between device files volume mounted in a pod and on the node. We have a workload pod that requests resource from device plugin and as the result that device is shared with the workload. The device file is visible to the pod as container_file_t and on the Node its device_t . The pod can access the file without getsebool container_use_devices on. The translation works in this case.
But in another case (GPU), the device file visible to the pod is container_file_t and on the node its dri_device_t. But in this case SELinux blocks the access and we see the following error.

Is this a bug that GPU devices dri_device_t are treated differently.

type=AVC msg=audit(1694192866.178:142): avc: denied { read write } for pid=2741395 comm="clinfo" name="renderD128" dev="devtmpfs" ino=124588 scontext=system_u:system_r:container_t:s0:c17,c28 tcontext=system_u:object_r:dri_device_t:s0 tclass=chr_file permissive=0

Thanks.

mregmi commented

We are currently testing a fix from Dan. containers/container-selinux#268