libnvidia-ml library not found in GPU AMIs

Question

libnvidia-ml library not found in GPU AMIs

ugurgural opened this issue 9 months ago · 10 comments

What happened:
Since last week we can't run any GPU workload under the latest GPU AMI. Basically we realized something is wrong when we checked the nvidia-device-plugin pod, stating this error : "Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory". After debugging inside the node it simply says that lot of libraries are missing related with libnvidia. Detailed logs are below.

What you expected to happen:
Be able to run GPU workloads with default GPU AMI

How to reproduce it (as minimally and precisely as possible):
nvidia-device-plugin-deamonset output:

I0402 12:47:47.247247 1 main.go:279] Retrieving plugins. W0402 12:47:47.247300 1 factory.go:31] No valid resources detected, creating a null CDI handler I0402 12:47:47.247350 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0402 12:47:47.247391 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0402 12:47:47.247405 1 factory.go:112] Incompatible platform detected E0402 12:47:47.247409 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0402 12:47:47.247414 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0402 12:47:47.247417 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0402 12:47:47.247422 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0402 12:47:47.247432 1 main.go:308] No devices found. Waiting indefinitely.

nvidia-smi output:

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+`

nvidia-container-cli debug output:

`sudo nvidia-container-cli -k -d /dev/tty info

-- WARNING, the following logs are for debugging purposes only --

I0402 12:53:01.959334 12668 nvc.c:367] initializing library context (version=1.4.0)
I0402 12:53:01.959387 12668 nvc.c:341] using root /
I0402 12:53:01.959402 12668 nvc.c:342] using ldcache /etc/ld.so.cache
I0402 12:53:01.959407 12668 nvc.c:343] using unprivileged user 65534:65534
I0402 12:53:01.959426 12668 nvc.c:384] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0402 12:53:01.959519 12668 nvc.c:386] dxcore initialization failed, continuing assuming a non-WSL environment
I0402 12:53:01.970498 12669 nvc.c:269] loading kernel module nvidia
I0402 12:53:01.970661 12669 nvc.c:273] running mknod for /dev/nvidiactl
I0402 12:53:01.970705 12669 nvc.c:277] running mknod for /dev/nvidia0
I0402 12:53:01.970735 12669 nvc.c:277] running mknod for /dev/nvidia1
I0402 12:53:01.970765 12669 nvc.c:281] running mknod for all nvcaps in /dev/nvidia-caps
I0402 12:53:01.977123 12669 nvc.c:209] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0402 12:53:01.977243 12669 nvc.c:209] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0402 12:53:01.979624 12669 nvc.c:287] loading kernel module nvidia_uvm
I0402 12:53:01.979680 12669 nvc.c:291] running mknod for /dev/nvidia-uvm
I0402 12:53:01.979763 12669 nvc.c:296] loading kernel module nvidia_modeset
I0402 12:53:01.979809 12669 nvc.c:300] running mknod for /dev/nvidia-modeset
I0402 12:53:01.980097 12670 driver.c:101] starting driver service
I0402 12:53:02.008978 12668 nvc_info.c:676] requesting driver information with ''
I0402 12:53:02.010682 12668 nvc_info.c:169] selecting /usr/lib64/libnvoptix.so.535.129.03
I0402 12:53:02.011468 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-tls.so.535.129.03
I0402 12:53:02.011859 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-rtcore.so.535.129.03
I0402 12:53:02.012252 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-ptxjitcompiler.so.535.129.03
I0402 12:53:02.012847 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-opticalflow.so.535.129.03
I0402 12:53:02.013352 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-opencl.so.535.129.03
I0402 12:53:02.014197 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-ngx.so.535.129.03
I0402 12:53:02.014238 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-ml.so.535.129.03
I0402 12:53:02.014708 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-glvkspirv.so.535.129.03
I0402 12:53:02.015077 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-glsi.so.535.129.03
I0402 12:53:02.015790 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-glcore.so.535.129.03
I0402 12:53:02.016238 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-fbc.so.535.129.03
I0402 12:53:02.016692 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-encode.so.535.129.03
I0402 12:53:02.017148 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-eglcore.so.535.129.03
I0402 12:53:02.017629 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-cfg.so.535.129.03
I0402 12:53:02.018059 12668 nvc_info.c:169] selecting /usr/lib64/libnvidia-allocator.so.535.129.03
I0402 12:53:02.018517 12668 nvc_info.c:169] selecting /usr/lib64/libnvcuvid.so.535.129.03
I0402 12:53:02.018704 12668 nvc_info.c:169] selecting /usr/lib64/libcuda.so.535.129.03
I0402 12:53:02.019626 12668 nvc_info.c:169] selecting /usr/lib64/libGLX_nvidia.so.535.129.03
I0402 12:53:02.020084 12668 nvc_info.c:169] selecting /usr/lib64/libGLESv2_nvidia.so.535.129.03
I0402 12:53:02.021057 12668 nvc_info.c:169] selecting /usr/lib64/libGLESv1_CM_nvidia.so.535.129.03
I0402 12:53:02.021495 12668 nvc_info.c:169] selecting /usr/lib64/libEGL_nvidia.so.535.129.03
W0402 12:53:02.021527 12668 nvc_info.c:350] missing library libnvidia-nscq.so
W0402 12:53:02.021540 12668 nvc_info.c:350] missing library libnvidia-fatbinaryloader.so
W0402 12:53:02.021546 12668 nvc_info.c:350] missing library libnvidia-compiler.so
W0402 12:53:02.021557 12668 nvc_info.c:350] missing library libvdpau_nvidia.so
W0402 12:53:02.021569 12668 nvc_info.c:350] missing library libnvidia-ifr.so
W0402 12:53:02.021574 12668 nvc_info.c:350] missing library libnvidia-cbl.so
W0402 12:53:02.021583 12668 nvc_info.c:354] missing compat32 library libnvidia-ml.so
W0402 12:53:02.021594 12668 nvc_info.c:354] missing compat32 library libnvidia-cfg.so
W0402 12:53:02.021607 12668 nvc_info.c:354] missing compat32 library libnvidia-nscq.so
W0402 12:53:02.021612 12668 nvc_info.c:354] missing compat32 library libcuda.so
W0402 12:53:02.021622 12668 nvc_info.c:354] missing compat32 library libnvidia-opencl.so
W0402 12:53:02.021632 12668 nvc_info.c:354] missing compat32 library libnvidia-ptxjitcompiler.so
W0402 12:53:02.021638 12668 nvc_info.c:354] missing compat32 library libnvidia-fatbinaryloader.so
W0402 12:53:02.021645 12668 nvc_info.c:354] missing compat32 library libnvidia-allocator.so
W0402 12:53:02.021649 12668 nvc_info.c:354] missing compat32 library libnvidia-compiler.so
W0402 12:53:02.021654 12668 nvc_info.c:354] missing compat32 library libnvidia-ngx.so
W0402 12:53:02.021667 12668 nvc_info.c:354] missing compat32 library libvdpau_nvidia.so
W0402 12:53:02.021672 12668 nvc_info.c:354] missing compat32 library libnvidia-encode.so
W0402 12:53:02.021683 12668 nvc_info.c:354] missing compat32 library libnvidia-opticalflow.so
W0402 12:53:02.021688 12668 nvc_info.c:354] missing compat32 library libnvcuvid.so
W0402 12:53:02.021698 12668 nvc_info.c:354] missing compat32 library libnvidia-eglcore.so
W0402 12:53:02.021703 12668 nvc_info.c:354] missing compat32 library libnvidia-glcore.so
W0402 12:53:02.021708 12668 nvc_info.c:354] missing compat32 library libnvidia-tls.so
W0402 12:53:02.021713 12668 nvc_info.c:354] missing compat32 library libnvidia-glsi.so
W0402 12:53:02.021718 12668 nvc_info.c:354] missing compat32 library libnvidia-fbc.so
W0402 12:53:02.021724 12668 nvc_info.c:354] missing compat32 library libnvidia-ifr.so
W0402 12:53:02.021729 12668 nvc_info.c:354] missing compat32 library libnvidia-rtcore.so
W0402 12:53:02.021734 12668 nvc_info.c:354] missing compat32 library libnvoptix.so
W0402 12:53:02.021740 12668 nvc_info.c:354] missing compat32 library libGLX_nvidia.so
W0402 12:53:02.021745 12668 nvc_info.c:354] missing compat32 library libEGL_nvidia.so
W0402 12:53:02.021750 12668 nvc_info.c:354] missing compat32 library libGLESv2_nvidia.so
W0402 12:53:02.021757 12668 nvc_info.c:354] missing compat32 library libGLESv1_CM_nvidia.so
W0402 12:53:02.021762 12668 nvc_info.c:354] missing compat32 library libnvidia-glvkspirv.so
W0402 12:53:02.021767 12668 nvc_info.c:354] missing compat32 library libnvidia-cbl.so
I0402 12:53:02.021899 12668 nvc_info.c:276] selecting /usr/bin/nvidia-smi
I0402 12:53:02.021941 12668 nvc_info.c:276] selecting /usr/bin/nvidia-debugdump
I0402 12:53:02.021979 12668 nvc_info.c:276] selecting /usr/bin/nvidia-persistenced
I0402 12:53:02.022009 12668 nvc_info.c:276] selecting /usr/bin/nv-fabricmanager
I0402 12:53:02.022045 12668 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-control
I0402 12:53:02.022082 12668 nvc_info.c:276] selecting /usr/bin/nvidia-cuda-mps-server
I0402 12:53:02.022117 12668 nvc_info.c:438] listing device /dev/nvidiactl
I0402 12:53:02.022128 12668 nvc_info.c:438] listing device /dev/nvidia-uvm
I0402 12:53:02.022140 12668 nvc_info.c:438] listing device /dev/nvidia-uvm-tools
I0402 12:53:02.022157 12668 nvc_info.c:438] listing device /dev/nvidia-modeset
W0402 12:53:02.022185 12668 nvc_info.c:321] missing ipc /var/run/nvidia-persistenced/socket
W0402 12:53:02.022219 12668 nvc_info.c:321] missing ipc /var/run/nvidia-fabricmanager/socket
W0402 12:53:02.022246 12668 nvc_info.c:321] missing ipc /tmp/nvidia-mps
I0402 12:53:02.022257 12668 nvc_info.c:733] requesting device information with ''
I0402 12:53:02.028453 12668 nvc_info.c:623] listing device /dev/nvidia0 (GPU-f9d01fac-33c7-8745-3017-c4632ea8ede1 at 00000000:00:1d.0)
I0402 12:53:02.034508 12668 nvc_info.c:623] listing device /dev/nvidia1 (GPU-3fb93927-23a0-a536-f602-cdcd87227e5f at 00000000:00:1e.0)
NVRM version: 535.129.03
CUDA version: 12.2

Device Index: 0
Device Minor: 0
Model: Tesla M60
Brand: Tesla
GPU UUID: GPU-f9d01fac-33c7-8745-3017-c4632ea8ede1
Bus Location: 00000000:00:1d.0
Architecture: 5.2

Device Index: 1
Device Minor: 1
Model: Tesla M60
Brand: Tesla
GPU UUID: GPU-3fb93927-23a0-a536-f602-cdcd87227e5f
Bus Location: 00000000:00:1e.0
Architecture: 5.2
I0402 12:53:02.034569 12668 nvc.c:418] shutting down library context
I0402 12:53:02.036416 12670 driver.c:163] terminating driver service
I0402 12:53:02.037260 12668 driver.c:203] driver service terminated successfully`

Anything else we need to know?:

Environment:

AWS Region: eu-central-1
Instance Type(s): all g3, g4 types
EKS Platform version (use aws eks describe-cluster --name <name> --query cluster.platformVersion): "eks.4"
Kubernetes version (use aws eks describe-cluster --name <name> --query cluster.version): "1.29"
AMI Version: amazon-eks-gpu-node-1.29-v20240315 (all 1.29 amis are giving same error regardless)
Kernel (e.g. uname -a): Linux ip-100-64-20-236.eu-central-1.compute.internal 5.10.192-183.736.amzn2.x86_64 #1 SMP Wed Sep 6 21:15:41 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Release information (run cat /etc/eks/release on a node):

BASE_AMI_ID="ami-0d0f88b572c6f9459"
BUILD_TIME="Wed Jan 17 22:16:02 UTC 2024"
BUILD_KERNEL="5.10.192-183.736.amzn2.x86_64"
ARCH="x86_64"

Answer 1 · 2024-04-02T14:59:22.000Z

This is affecting us as well. Karpenter recently auto-updated us to v20240329 (running EKS v1.27), and it broke all of our GPU nodes. We reverted back to v20240315 and it seems to be ok now.

Lesson learned: don't let karpenter automatically upgrade AMIs

Answer 2 · 2024-04-02T15:40:23.000Z

I think you’re running into the issue described here: #1697 (comment)

Can you check the containerd config file on one of your nodes and see if the NVIDIA runtime is set?

is anything else going on in your user data before the bootstrap script executes?

Answer 3 · 2024-04-02T15:49:51.000Z

is anything else going on in your user data before the bootstrap script executes?

We use karpenter and any userdata is on the default.

Answer 4 · 2024-04-02T15:52:37.000Z

Hi @cartermckinnon, thanks for having a look.

I checked that issue you mentioned last week but didn't think my case is the same since I do nothing custom in my setup, just karpenter launches the ec2 with default latest gpu ami, install docker inside user data since some of my workloads needs to expose the docker host(dind), then add the nvidia-device-plugin pod to activate gpu for other pods.

Here is the containerd config file, this must be what is generated by default:

sh-4.2$ sudo containerd config dump
disabled_plugins = []
imports = ["/etc/containerd/config.toml"]
oom_score = 0
plugin_dir = ""
required_plugins = []
root = "/var/lib/containerd"
state = "/run/containerd"
temp = ""
version = 2

[cgroup]
  path = ""

[debug]
  address = ""
  format = ""
  gid = 0
  level = ""
  uid = 0

[grpc]
  address = "/run/containerd/containerd.sock"
  gid = 0
  max_recv_message_size = 16777216
  max_send_message_size = 16777216
  tcp_address = ""
  tcp_tls_ca = ""
  tcp_tls_cert = ""
  tcp_tls_key = ""
  uid = 0

[metrics]
  address = ""
  grpc_histogram = false

[plugins]

  [plugins."io.containerd.gc.v1.scheduler"]
    deletion_threshold = 0
    mutation_threshold = 100
    pause_threshold = 0.02
    schedule_delay = "0s"
    startup_delay = "100ms"

  [plugins."io.containerd.grpc.v1.cri"]
    cdi_spec_dirs = ["/etc/cdi", "/var/run/cdi"]
    device_ownership_from_security_context = false
    disable_apparmor = false
    disable_cgroup = false
    disable_hugetlb_controller = true
    disable_proc_mount = false
    disable_tcp_service = true
    drain_exec_sync_io_timeout = "0s"
    enable_cdi = false
    enable_selinux = false
    enable_tls_streaming = false
    enable_unprivileged_icmp = false
    enable_unprivileged_ports = false
    ignore_image_defined_volumes = false
    image_pull_progress_timeout = "1m0s"
    max_concurrent_downloads = 3
    max_container_log_line_size = 16384
    netns_mounts_under_state_dir = false
    restrict_oom_score_adj = false
    sandbox_image = "602401143452.dkr.ecr.eu-central-1.amazonaws.com/eks/pause:3.5"
    selinux_category_range = 1024
    stats_collect_period = 10
    stream_idle_timeout = "4h0m0s"
    stream_server_address = "127.0.0.1"
    stream_server_port = "0"
    systemd_cgroup = false
    tolerate_missing_hugetlb_controller = true
    unset_seccomp_profile = ""

    [plugins."io.containerd.grpc.v1.cri".cni]
      bin_dir = "/opt/cni/bin"
      conf_dir = "/etc/cni/net.d"
      conf_template = ""
      ip_pref = ""
      max_conf_num = 1
      setup_serially = false

    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "runc"
      disable_snapshot_annotations = true
      discard_unpacked_layers = true
      ignore_blockio_not_enabled_errors = false
      ignore_rdt_not_enabled_errors = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        privileged_without_host_devices_all_devices_allowed = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""
        sandbox_mode = ""
        snapshotter = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
          base_runtime_spec = ""
          cni_conf_dir = ""
          cni_max_conf_num = 0
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          privileged_without_host_devices_all_devices_allowed = false
          runtime_engine = ""
          runtime_path = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          sandbox_mode = ""
          snapshotter = ""

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
            SystemdCgroup = true

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        cni_conf_dir = ""
        cni_max_conf_num = 0
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        privileged_without_host_devices_all_devices_allowed = false
        runtime_engine = ""
        runtime_path = ""
        runtime_root = ""
        runtime_type = ""
        sandbox_mode = ""
        snapshotter = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = "/etc/containerd/certs.d:/etc/docker/certs.d"

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]

    [plugins."io.containerd.grpc.v1.cri".x509_key_pair_streaming]
      tls_cert_file = ""
      tls_key_file = ""

  [plugins."io.containerd.internal.v1.opt"]
    path = "/opt/containerd"

  [plugins."io.containerd.internal.v1.restart"]
    interval = "10s"

  [plugins."io.containerd.internal.v1.tracing"]
    sampling_ratio = 1.0
    service_name = "containerd"

  [plugins."io.containerd.metadata.v1.bolt"]
    content_sharing_policy = "shared"

  [plugins."io.containerd.monitor.v1.cgroups"]
    no_prometheus = false

  [plugins."io.containerd.nri.v1.nri"]
    disable = true
    disable_connections = false
    plugin_config_path = "/etc/nri/conf.d"
    plugin_path = "/opt/nri/plugins"
    plugin_registration_timeout = "5s"
    plugin_request_timeout = "2s"
    socket_path = "/var/run/nri/nri.sock"

  [plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "runc"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false

  [plugins."io.containerd.runtime.v2.task"]
    platforms = ["linux/amd64"]
    sched_core = false

  [plugins."io.containerd.service.v1.diff-service"]
    default = ["walking"]

  [plugins."io.containerd.service.v1.tasks-service"]
    blockio_config_file = ""
    rdt_config_file = ""

  [plugins."io.containerd.snapshotter.v1.aufs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.btrfs"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.devmapper"]
    async_remove = false
    base_image_size = ""
    discard_blocks = false
    fs_options = ""
    fs_type = ""
    pool_name = ""
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.native"]
    root_path = ""

  [plugins."io.containerd.snapshotter.v1.overlayfs"]
    root_path = ""
    upperdir_label = false

  [plugins."io.containerd.snapshotter.v1.zfs"]
    root_path = ""

  [plugins."io.containerd.tracing.processor.v1.otlp"]
    endpoint = ""
    insecure = false
    protocol = ""

  [plugins."io.containerd.transfer.v1.local"]
    config_path = ""
    max_concurrent_downloads = 3
    max_concurrent_uploaded_layers = 3

    [[plugins."io.containerd.transfer.v1.local".unpack_config]]
      differ = ""
      platform = "linux/amd64"
      snapshotter = "overlayfs"

[proxy_plugins]

[stream_processors]

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar"

  [stream_processors."io.containerd.ocicrypt.decoder.v1.tar.gzip"]
    accepts = ["application/vnd.oci.image.layer.v1.tar+gzip+encrypted"]
    args = ["--decryption-keys-path", "/etc/containerd/ocicrypt/keys"]
    env = ["OCICRYPT_KEYPROVIDER_CONFIG=/etc/containerd/ocicrypt/ocicrypt_keyprovider.conf"]
    path = "ctd-decoder"
    returns = "application/vnd.oci.image.layer.v1.tar+gzip"

[timeouts]
  "io.containerd.timeout.bolt.open" = "0s"
  "io.containerd.timeout.metrics.shimstats" = "2s"
  "io.containerd.timeout.shim.cleanup" = "5s"
  "io.containerd.timeout.shim.load" = "5s"
  "io.containerd.timeout.shim.shutdown" = "3s"
  "io.containerd.timeout.task.state" = "2s"

[ttrpc]
  address = ""
  gid = 0
  uid = 0

And here is my complete user data from ec2 console :

MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="//"

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash
# Enable docker service
sudo yum -y install docker
sudo systemctl enable docker.service
sudo rm -rf /var/run/docker.sock
sudo systemctl start docker.service

--//
Content-Type: text/x-shellscript; charset="us-ascii"

#!/bin/bash -xe
exec > >(tee /var/log/user-data.log|logger -t user-data -s 2>/dev/console) 2>&1
/etc/eks/bootstrap.sh 'ai-dev-eu1-eks-01-private' --apiserver-endpoint 'https://eksurlhere.eks.amazonaws.com' --b64-cluster-ca 'privatekeyhere' \
--container-runtime containerd \
--dns-cluster-ip '172.20.0.10' \
--use-max-pods false \
--kubelet-extra-args '--node-labels="karpenter.sh/capacity-type=on-demand,karpenter.sh/nodepool=gpu-ondemand" --register-with-taints="nvidia.com/gpu=true:NoSchedule" --max-pods=234'
--//--

Answer 5 · 2024-04-02T16:31:42.000Z

I also got a message from AWS enterprise support, this issue is still present but maybe it helps anyone for a quick workaround for now:

I reached out to the internal team to take a deeper look, and they have confirmed that the instance type is using an incorrect NVIDIA kernel module on amazon-eks-gpu-node-1.29-v20240227. A new AMI release is planned for April 5th, 2024 and that will address the issue.

In the meanwhile they ask that you please revert to the older K8s AMI version (1.27 with kernel older than 5.10.192-183.736.amzn2). On behalf of AWS, I apologize for the inconvenience.

Answer 6 · 2024-04-02T19:09:34.000Z

That internal team is us 😄 but that's referring to a different issue that has been addressed in the latest release (the incorrect module being loaded for some chipsets).

The issue described here is likely because the NVIDIA runtime is not being used by containerd, resulting in the NVIDIA shared libraries not being injected into the container's LD_LIBRARY_PATH. This seems to be due to a race condition between the bootstrap script and the background process that manages the NVIDIA bits. We're working on a fix.

Answer 7 · 2024-04-03T18:02:27.000Z

We also ran into this while trying to test v20240329 on Kubernetes 1.28 after upgrading to 1.28 broke our previously working GPU AMI with the GSP failures described elsewhere.

The only userdata change we have for this environment is setting up NVME disks, setting seccomp profiles and disabling coredumps for containerd:

sed -i 's/LimitCORE=infinity/LimitCORE=0/g' /usr/lib/systemd/system/containerd.service
systemctl daemon-reload

This is all performed before we execute the bootstrap.sh script.

Answer 8 · 2024-04-04T17:43:46.000Z

I have hit the same issue. The NVIDIA GPU Operator(v23.9.1) appears to function correctly with the latest GPU AMI (amazon/amazon-eks-gpu-node-1.29-v20240329).

I resolved the problem by removing the NVIDIA Device Plugin(v0.15.0-rc.2) and relying solely on the NVIDIA GPU Operator.

Hopefully, a new patch of EKS AMI should resolve the issue with NVIDIA Device plugin.

Answer 9 · 2024-04-08T17:33:37.000Z

Watching as seeing similar errors.

Answer 10 · 2024-04-11T18:32:19.000Z

This should be resolved in the latest release, v20240409. 👍