[6.1] Cannot pass through GPU - VM hangs
Closed this issue · 2 comments
Required information
-
Distribution: Ubuntu
-
Distribution version: 24.04
-
The output of "snap list --all lxd core20 core22 core24 snapd":
cis-juju snap list --all lxd core20 core22 core24 snapd
Name Version Rev Tracking Publisher Notes
core20 20240705 2379 latest/stable canonical✓ base,disabled
core20 20240911 2434 latest/stable canonical✓ base
core22 20240904 1621 latest/stable canonical✓ base,disabled
core22 20241001 1663 latest/stable canonical✓ base
core24 20240710 490 latest/stable canonical✓ base,disabled
core24 20240920 609 latest/stable canonical✓ base
lxd 6.1-efad198 29943 latest/stable canonical✓ disabled
lxd 6.1-78a3d8f 30130 latest/stable canonical✓ -
snapd 2.62 21465 latest/stable canonical✓ snapd,disabled
snapd 2.63 21759 latest/stable canonical✓ snapd -
The output of "lxc info" or if that fails:
- Kernel version: 6.8.0-48-generic
- LXC version: 6.1-78a3d8f
- LXD version: 6.1-78a3d8f
- Storage backend in use: zfs
Issue description
When attempting to pass through a GPU the VM will hang indefinitely, not producing any output.
Steps to reproduce
- Blacklist the nvidia module
- Create VM: lxc init --vm ubuntu:jammy gpu-test --network lxdbr0 --storage nvme -c limits.cpu=4 -c limits.memory=8GiB
- Attach GPU: lxc config device add gpu-test nvidia-gpu gpu pci=01:00.0
- Attempt to start the VM
Checking lxc list, or lxc exec, the VM is not responsive
Information to attach
- Any relevant kernel output (
dmesg
) - Container log (
lxc info NAME --show-log
) - Container configuration (
lxc config show NAME --expanded
) - Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
- Output of the client with --debug
- Output of the daemon with --debug (alternatively output of
lxc monitor
while reproducing the issue) - lscpi
If I try to run this execise on another environment I get a different error:
lxc start gpu-test
Error: Failed to start device "nvidia-gpu": Failed to override IOMMU group driver: Device took too long to activate at "/sys/bus/pci/drivers/vfio-pci/0000:21:00.0"
Try `lxc info --show-log gpu-test` for more info
(reverse-i-search)`l': ^Cc start gpu-test
130 ubuntu@epyc-cpu:~$ lspci -vvnn | less
ubuntu@epyc-cpu:~$ lxc info --show-log gpu-test
Name: gpu-test
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Created: 2024/11/01 15:02 UTC
Error: open /var/snap/lxd/common/lxd/logs/gpu-test/qemu.log: no such file or directory
```
In the end, it was a firmware configuration. In my case - I have a 24G GPU, so passing it through I need to increase the MMIO Size. See https://edk2.groups.io/g/discuss/topic/ovmf_resource_assignment/59340711 for details.
The fix was adding the QEMU config: raw.qemu: -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536