Unable to access GPU when VirtualGL is installed
j3soon opened this issue · 5 comments
The following error would occur when trying to access the GPU inside a Docker container with non-root user on a system with VirtualGL installed:
$ nvidia-smi
Failed to initialize NVML: Insufficient Permissions
This is mentioned by @YuZhong-Chen on August 9th.
And reproduced by @ClassLongJoe1112 and @j3soon
This is because VirtualGL, by default, changes the group ownership of GPU devices to vglusers
. For an example:
$ ls -l /dev | grep nvidia
drwxr-xr-x 2 root root 80 Aug 4 19:26 nvidia-caps
crw-rw---- 1 root vglusers 195, 254 Aug 4 19:26 nvidia-modeset
crw-rw-rw- 1 root root 510, 0 Aug 4 19:26 nvidia-uvm
crw-rw-rw- 1 root root 510, 1 Aug 4 19:26 nvidia-uvm-tools
crw-rw---- 1 root vglusers 195, 0 Aug 4 19:26 nvidia0
crw-rw---- 1 root vglusers 195, 1 Aug 4 19:26 nvidia1
crw-rw---- 1 root vglusers 195, 255 Aug 4 19:26 nvidiactl
To resolve this, there are two potential solutions.
- Ask the server admin/IT to re-configure VirtualGL to not change the group ownership.
- Run
/opt/VirtualGL/bin/vglserver_config
and unconfigure VirtualGL (ref) - Run
/opt/VirtualGL/bin/vglserver_config
and re-configure VirtualGL without changing the group ownership by setting No (n
) for the following two options (ref):Restrict 3D X server access to vglusers group (recommended)? Restrict framebuffer device access to vglusers group (recommended)?
- Run
- Modify Dockerfile to add the default
user
tovglusers
group. However, since thevglusers
group may have different Group ID (GID) across different machines, this approach cannot be made portable.
I believe the first solution is the best option, since the second solution is not portable across different machines.
However, it is worth noting that the first solution requires all users on the server to be trusted. If there exist untrusted users, the first solution may cause security risks, and you may prefer the second solution/workaround.
Hi @j3soon,
Please append the following argument to the command docker run
when running a container on a VirtualGL-installed system.
--group-add $(getent group vglusers | cut -d: -f3)
Regards,
Kuan-Yu
Hi @KuanYuChang,
Thanks for sharing this. I hadn’t thought of adding a group when launching the container, which can access the GPU without modifying the Dockerfile.
Hi @YuZhong-Chen,
I think we can keep the Dockerfile intact, and use a hardcoded group ID in the compose.yaml
on that specific machine for now. See this docs for adding the vglusers
GID.
I'm thinking of a portable way to support this, which may be achieved through the following shell command:
(getent group vglusers || echo user:x:1000) | cut -d: -f3
which outputs the vglusers
GID if it exists, and outputs 1000
otherwise.
Ref: https://stackoverflow.com/a/69987399
However, docker compose files doesn't seem to allow shell script expansion. We may need to use a wrapper for docker compose up
to achieve this, which may be an overkill since VirtualGL may not exist in most systems.
Ref: docker/compose#4081
I think we can leave this to the users when they're using machines with VirtualGL, and ask them to add the hardcoded GID in the docker compose files.
We can come back to this issue later if someone come up with a portable solution for docker compose. Thanks!
Self note:
The devcontainer will update the container's user UID and GID to match the local user. This will avoid permission problems with bind mounts. After hardcoding the group ID in the compose.yaml file, if you want to use devcontainer to launch the container, remember to add "updateRemoteUserUID": false,
in the devcontainer.json file to prevent devcontainer update your UID and GID. Ref