canonical/lxd

[6.1] Cannot pass through GPU - VM hangs

Closed this issue · 2 comments

Required information

  • Distribution: Ubuntu

  • Distribution version: 24.04

  • The output of "snap list --all lxd core20 core22 core24 snapd":
    cis-juju snap list --all lxd core20 core22 core24 snapd
    Name Version Rev Tracking Publisher Notes
    core20 20240705 2379 latest/stable canonical✓ base,disabled
    core20 20240911 2434 latest/stable canonical✓ base
    core22 20240904 1621 latest/stable canonical✓ base,disabled
    core22 20241001 1663 latest/stable canonical✓ base
    core24 20240710 490 latest/stable canonical✓ base,disabled
    core24 20240920 609 latest/stable canonical✓ base
    lxd 6.1-efad198 29943 latest/stable canonical✓ disabled
    lxd 6.1-78a3d8f 30130 latest/stable canonical✓ -
    snapd 2.62 21465 latest/stable canonical✓ snapd,disabled
    snapd 2.63 21759 latest/stable canonical✓ snapd

  • The output of "lxc info" or if that fails:

    • Kernel version: 6.8.0-48-generic
    • LXC version: 6.1-78a3d8f
    • LXD version: 6.1-78a3d8f
    • Storage backend in use: zfs

Issue description

When attempting to pass through a GPU the VM will hang indefinitely, not producing any output.

Steps to reproduce

  1. Blacklist the nvidia module
  2. Create VM: lxc init --vm ubuntu:jammy gpu-test --network lxdbr0 --storage nvme -c limits.cpu=4 -c limits.memory=8GiB
  3. Attach GPU: lxc config device add gpu-test nvidia-gpu gpu pci=01:00.0
  4. Attempt to start the VM

Checking lxc list, or lxc exec, the VM is not responsive

Information to attach

  • Any relevant kernel output (dmesg)
  • Container log (lxc info NAME --show-log)
  • Container configuration (lxc config show NAME --expanded)
  • Main daemon log (at /var/log/lxd/lxd.log or /var/snap/lxd/common/lxd/logs/lxd.log)
  • Output of the client with --debug
  • Output of the daemon with --debug (alternatively output of lxc monitor while reproducing the issue)
  • lscpi

debug.lxd.log

If I try to run this execise on another environment I get a different error:

lxc start gpu-test
Error: Failed to start device "nvidia-gpu": Failed to override IOMMU group driver: Device took too long to activate at "/sys/bus/pci/drivers/vfio-pci/0000:21:00.0"
Try `lxc info --show-log gpu-test` for more info
(reverse-i-search)`l': ^Cc start gpu-test
130 ubuntu@epyc-cpu:~$ lspci -vvnn | less
ubuntu@epyc-cpu:~$ lxc info --show-log gpu-test
Name: gpu-test
Status: STOPPED
Type: virtual-machine
Architecture: x86_64
Created: 2024/11/01 15:02 UTC
Error: open /var/snap/lxd/common/lxd/logs/gpu-test/qemu.log: no such file or directory

```

In the end, it was a firmware configuration. In my case - I have a 24G GPU, so passing it through I need to increase the MMIO Size. See https://edk2.groups.io/g/discuss/topic/ovmf_resource_assignment/59340711 for details.

The fix was adding the QEMU config: raw.qemu: -fw_cfg name=opt/ovmf/X-PciMmio64Mb,string=65536