NVIDIA/spark-rapids

[BUG] Failed to run mig.sh on MIG dataproc-2.1-ubuntu20

Opened this issue · 4 comments

Describe the bug
Observed following error while running mig.sh on dataproc-2.1-ubuntu20 with runtime version "2.1.72-ubuntu20" and kernel version "5.15.0-1067-gcp".

 make -f ./scripts/Makefile.modpost
   sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/495.29.05/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/495.29.05/build/Module.symvers -e -i Module.symvers   -T -
 ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
 make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/495.29.05/build/Module.symvers] Error 1

Tried with some old dataproc runtime versions. It works with runtime version "2.1.40-ubuntu20" and kernel version "5.15.0-1049-gcp".

Steps/Code to reproduce bug

  1. Create dataproc cluster using MIG with nvidia-tesla-a100 gpu and runtime version "2.1.72-ubuntu20"
  2. ssh to gpu node
  3. download mig.sh
  4. sudo bash mig.sh

Expected behavior
succeed to run mig.sh

Environment details (please complete the following information)

  • Environment location: Dataproc, version 2.1.72-ubuntu20
pxLi commented

thanks for the investigation!

@sameerz This is the reason why mig-on-dataproc-2.1-ubuntu20 has been failing to initialize recently.

Hello @yinqingh, I think you're using a different version of /gpu/mig.sh
Can you try with /spark-rapids/mig.sh?

I’ll inform the repository maintainers about this inconsistency.

Edit: Created issue GoogleCloudDataproc/initialization-actions#1259

Hi @SurajAralihalli , I tried with spark-rapids/mig.sh but it still failed in installing nvidia driver (535.104.05) with the same error. The dataproc runtime version is "2.1.73-ubuntu20".

 make -f ./scripts/Makefile.modpost
   sed 's/\.ko$/\.o/' /var/lib/dkms/nvidia/535.104.05/build/modules.order | scripts/mod/modpost -m -a  -o /var/lib/dkms/nvidia/535.104.05/build/Module.symvers -e -i Module.symvers   -T -
 ERROR: modpost: GPL-incompatible module nvidia.ko uses GPL-only symbol 'rcu_read_unlock_strict'
 make[2]: *** [scripts/Makefile.modpost:133: /var/lib/dkms/nvidia/535.104.05/build/Module.symvers] Error 1
 make[2]: *** Deleting file '/var/lib/dkms/nvidia/535.104.05/build/Module.symvers'
 make[1]: *** [Makefile:1829: modules] Error 2
 make[1]: Leaving directory '/usr/src/linux-headers-5.15.0-1070-gcp'
 make: *** [Makefile:82: modules] Error 2
DKMSKernelVersion: 5.15.0-1070-gcp
Date: Fri Nov  8 09:07:43 2024
Package: nvidia-dkms-535 535.104.05-0ubuntu1
PackageVersion: 535.104.05-0ubuntu1
SourcePackage: nvidia-graphics-drivers-535
Title: nvidia-dkms-535 535.104.05-0ubuntu1: nvidia kernel module failed to build