The GPU Operator driver build fails on GCP when using Ubuntu 22.04.

Question

The GPU Operator driver build fails on GCP when using Ubuntu 22.04.

uniit opened this issue a year ago · 21 comments

1. Quick Debug Information

OS/Version: Ubuntu 22.04
Kernel Version:

Linux version 5.19.0-1030-gcp (buildd@bos03-amd64-050) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #32~22.04.1-Ubuntu SMP Thu Jul 13 09:36:23 UTC 2023

GPU Operator Version: GPU Operator 23.3.2 Release

2. Issue or feature description

The issue with the GPU Operator is that it cannot build Nvidia drivers on GCP with Ubuntu 22.04 due to the usage of the x86_64-linux-gnu-gcc-12 compiler in the build process. This incompatibility is causing the build to fail.

warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

Ubuntu 22.04 on AWS works well because it uses a similar major version of the compiler as the one used on the generic kernels. This similarity in compiler versions allows the GPU Operator to build Nvidia drivers successfully on AWS with Ubuntu 22.04.

Linux version 5.15.0-1034-aws (buildd@lcy02-amd64-114) (gcc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) #38~20.04.1-Ubuntu SMP Wed Mar 29 19:48:16 UTC 2023

3. Steps to reproduce the issue

Create GCP instance (N1 + NVIDIA T4) with Ubuntu 22.04.
Install k3s:

curl -sfL https://get.k3s.io | sh -

Install GPU Operator with the followiing helm values:

USER-SUPPLIED VALUES:
driver:
  enabled: true
operator:
  cleanupCRD: true
  defaultRuntime: containerd
toolkit:
  env:
  - name: CONTAINERD_CONFIG
    value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
  - name: CONTAINERD_SOCKET
    value: /run/k3s/containerd/containerd.sock
  - name: CONTAINERD_RUNTIME_CLASS
    value: nvidia
  - name: CONTAINERD_SET_AS_DEFAULT
    value: "true"

4. Question:

Can I easily include gcc-12 in the driver image and change the build instructions to utilize it, either through an environment variable or by overriding the initial command?

Is there a plan to introduce support for Ubuntu 22.04 on GCP?

Answer 1 · 2023-08-08T20:30:29.000Z

Hi @uniit can you provide complete logs from the driver container?

Answer 2 · 2023-08-14T13:29:23.000Z

Hi @uniit can you provide complete logs from the driver container?

Sure, please review it below:

k logs nvidia-driver-daemonset-d6jsz -n nvidia -f
Defaulted container "nvidia-driver-ctr" out of: nvidia-driver-ctr, k8s-driver-manager (init)
DRIVER_ARCH is x86_64
Creating directory NVIDIA-Linux-x86_64-525.105.17
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 525.105.17...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-modules' command line option, nvidia-installer will not install any kernel modules as part of this driver installation,and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that NVIDIA kernel modules matching this driver version are installed separately.


WARNING: This NVIDIA driver package includes Vulkan components, but no Vulkan ICD loader was detected on this system. The NVIDIA Vulkan ICD will not function without the loader. Most distributions package the Vulkan loader; try installing the "vulkan-loader", "vulkan-icd-loader", or "libvulkan1" package.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 525.105.17 for Linux kernel version 5.19.0-1030-gcp

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.19.0-1030-gcp
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.1.0-2ubuntu1~22.04) 12.1.0
  You are using:           cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-acpi.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-dmabuf.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-pci.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-nano-timer.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-dma.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-cray.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-p2p.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-mmap.o] Error 1
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-i2c.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-pat.o] Error 1
cc: error: unrecognized command-line option '-ftrivial-auto-var-init=zero'
make[2]: *** [scripts/Makefile.build:257: /usr/src/nvidia-525.105.17/kernel/nvidia/nv-procfs.o] Error 1
make[1]: *** [Makefile:1857: /usr/src/nvidia-525.105.17/kernel] Error 2
make: *** [Makefile:82: modules] Error 2
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

Answer 3 · 2023-08-14T15:30:13.000Z

@uniit can you try with driver version 525.125.06 which should use an updated CUDA base image which comes with GCC version as below. Looks like minimal version of 12 required to support this feature flag.

ii gcc-12-base:amd64 12.1.0-2ubuntu1~22.04 amd64 GCC, the GNU Compiler Collection (base package)

Answer 4 · 2023-08-14T18:58:04.000Z

@shivamerla I've tried nvcr.io/nvidia/driver:525.125.06-ubuntu22.04. Looks like gcc-12 is still missing there. Errors are the same.

Answer 5 · 2023-08-14T19:34:29.000Z

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here
Install 12.x GCC and overwrite what is installed with build-essential meta package here.

apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

Answer 6 · 2024-03-06T15:19:23.000Z

@shivamerla Any news on this ?

Shouldn't this be managed directly within driver containers rather than by clients?

Answer 7 · 2024-03-19T16:04:41.000Z

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here

Install 12.x GCC and overwrite what is installed with build-essential meta package here.
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

@shivamerla Any news on this ?

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

Build an image with pre-compiled modules for -gcp kernel from precompiled folder using steps here

Install 12.x GCC and overwrite what is installed with build-essential meta package here.
apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

nice bro you are right!

Answer 8 · 2024-03-26T00:12:28.000Z

I'm cross-posting here a bit, but I'm having this same issue, although I'm deploying via Cloud Native Stack (one of the Ansible playbooks) on Ubuntu 22.04. I'm unsure if upgrading my CNS version will solve this, based on what's been said above and that the relevant install.sh/Dockerfile look about the same. Hopefully I'm missing something.

Any feedback/insight would be appreciated. Thank you.

Answer 9 · 2024-03-26T15:52:08.000Z

@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels.

Answer 10 · 2024-03-26T16:02:37.000Z

@BHSDuncan @xzzvsxd we are looking to fix this soon as this is affecting all customers using 6.x kernels.

Thank you for the update. Is there any timeline for when we can expect this fix?

Also, is it safe to assume that the only real work around if using Cloud Native Stack (i.e. being unable to easily rebuild images) is to rollback to a 5.x kernel? E.g. 5.19.0

Answer 11 · 2024-03-26T17:40:35.000Z

Yes, you can check the GCC version used by referring to /proc/version on the host. With GCC 11.x builds are working fine.

Answer 12 · 2024-03-26T17:57:52.000Z

Yes, you can check the GCC version used by referring to /proc/version on the host. With GCC 11.x builds are working fine.

It's the driver daemonset pod that's failing, specifically the nvidia-driver-ctr container, and I can't change how it's built (i.e. I can't tell it which version of GCC to download/install since this was all installed via a Cloud Native Stack Ansible playbook).

Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod.

Answer 13 · 2024-03-31T00:11:38.000Z

Changing the version of GCC on the host doesn't fix anything. For example, I did try installing GCC 12 on a host, but the problem still happens when starting the driver daemonset pod.

@BHSDuncan You must change the version of GCC used to build the kernel on the host. So rollback to kernel 5.X.

Changing the version of GCC within the container should work, but assuming your previous messages you can't.

Answer 14 · 2024-03-31T00:15:23.000Z

Yeah, because I'm using CNS, I can't do much of anything other than rely on their images.

Answer 15 · 2024-04-15T09:36:45.000Z

@uniit you are right, we install build-essential meta package which is pulling GCC 11.x version by default and doesn't support the options that the kernel is built with. Can think of below options in this case.

1. Build an image with pre-compiled modules for `-gcp` kernel from precompiled folder using steps [here](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/precompiled-drivers.html#building-a-custom-driver-container-image)

2. Install 12.x GCC and overwrite what is installed with `build-essential` meta package [here](https://gitlab.com/nvidia/container-images/driver/-/blame/main/ubuntu22.04/install.sh#L10).

apt-get install gcc-12
update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-12 12

Hi @shivamerla, I see gcc-12 is already installed here https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L97-101 and set to the alternatives. Just curious the image is built from this Dockerfile or not?

Answer 16 · 2024-04-16T05:20:35.000Z

I build a driver myself from latest code (with the fix https://gitlab.com/nvidia/container-images/driver/-/commit/dd69782dc6a21aa92ded68fb9db58bd4b1a23a4a) can workaround temporarily:
docker build -t mydriver --build-arg DRIVER_VERSION="550.54.14" --build-arg DRIVER_BRANCH="550" --build-arg CUDA_VERSION=12.4.0 --build-arg TARGETARCH=amd64 ubuntu22.04 :

Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
  You are using:           cc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-peermem.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-modeset.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-drm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia-uvm.ko due to unavailability of vmlinux
Skipping BTF generation for /usr/src/nvidia-550.54.14/kernel/nvidia.ko due to unavailability of vmlinux
Relinking NVIDIA driver kernel modules...

Answer 17 · 2024-04-16T17:07:45.000Z

@wyike Thanks for testing the CI pipeline build and confirming that it works. We will have this fix out when the next Data Center GPU Driver is released by the driver team. We (the gpu-operator team) build the driver containers off of the driver releases managed by the driver team

See here to get more info on the Data Center drivers: https://docs.nvidia.com/datacenter/tesla/

Answer 18 · 2024-04-19T09:21:41.000Z

Hi @tariq1890 sorry to ask a very junior question, why driver installer has to be built upon https://gitlab.com/nvidia/container-images/driver/-/blob/main/ubuntu22.04/Dockerfile#L2 cuda-base image? Anything in this image will be used by installer to install to the host?
I would like to ask it somewhere like community slack channel and couldn't find one. Would you help answer the question or point out somewhere public I could ask the general questions, thanks a lot!

Answer 19 · 2024-04-25T22:32:47.000Z

I managed to get this to work around by forcing ubuntu to start with kernel 5.15.0-79, that fixed the issue and I was able to get the driver installed via the daemonset

Answer 20 · 2024-05-08T05:47:09.000Z

Will this fix, can the new driver installer still be used on previous kernel versions?

Answer 21 · 2024-06-13T15:42:36.000Z

For those coming across this because GKE auto upgrades, we were able to pin the node group version to 1.27.11-gke.1062003 because 1.27.11-gke.1062004 introduced a new kernel which triggers this issue