Mellanox/nv_peer_memory

Error occurs in `sudo dpkg -i nvidia-peer-memory-dkms_1.0-1_all.deb`

Closed this issue · 17 comments

feiga commented

@rleon

I'm trying to install this module for GPU Direct RDMA. But the error occurs when I install sudo dpkg -i nvidia-peer-memory-dkms_1.0-1_all.deb

Unpacking nvidia-peer-memory-dkms (1.0-1) over (1.0-1) ...
Setting up nvidia-peer-memory-dkms (1.0-1) ...

Creating symlink /var/lib/dkms/nvidia-peer-memory/1.0/source ->
                 /usr/src/nvidia-peer-memory-1.0

DKMS: add completed.

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area....
make KERNELRELEASE=3.13.0-98-generic all && make DESTDIR=/var/lib/dkms/nv_peer_mem/1.0/build install....
cleaning build area....

DKMS: build completed.

nv_peer_mem:
Running module version sanity check.
 - Original module
   - No original module exists within this kernel
 - Installation
   - Installing to /lib/modules/3.13.0-98-generic/updates/dkms/

depmod....

DKMS: install completed.
modprobe: ERROR: could not insert 'nv_peer_mem': Invalid argument

Do you know what's the problems?
Thanks!

feiga commented

I'm running on Ubuntu 14.04

rleon commented

@Artemy-Mellanox @alaahl
I would bet that nv_peer_mem was built and installed for wrong kernel

Can we get kernel version and dmesg output?

@feiga could you also write the version of CUDA you are using? I think we have a symbol versioning issue with CUDA 8.

feiga commented

Thanks! I'm using cuda 7.5. The kernel version is 3.13.0-98-generic.

feiga commented

@alaahl @haggaie This is the dmesg output

[ 36.849441] nvidia 0000:33:00.0: irq 253 for MSI/MSI-X
[ 366.264584] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 366.264587] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 366.264590] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 366.264591] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 366.264610] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 366.264611] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 366.264620] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 366.264621] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 372.562565] nv_tco: NV TCO WatchDog Timer Driver v0.01
[ 406.887884] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 406.887888] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 406.887892] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 406.887893] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 406.887912] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 406.887912] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 406.887920] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 406.887921] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 406.889608] init: nv_peer_mem pre-start process (2463) terminated with status 1
[ 501.183418] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 501.183421] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 501.183425] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 501.183426] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 501.183445] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 501.183446] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 501.183453] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 501.183454] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 501.184658] init: nv_peer_mem pre-start process (2558) terminated with status 1
[ 514.277683] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 514.277685] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 514.277689] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 514.277690] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 514.277707] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 514.277707] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 514.277715] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 514.277716] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 514.278976] init: nv_peer_mem pre-start process (2615) terminated with status 1
[ 1190.978791] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1190.978794] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1190.978798] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1190.978799] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1190.978821] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1190.978822] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1190.978833] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1190.978834] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 1194.929216] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1194.929218] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1194.929221] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1194.929222] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1194.929229] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1194.929230] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1194.929237] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1194.929238] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 1226.456803] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1226.456806] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1226.456809] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1226.456810] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1226.456819] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1226.456820] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1226.456827] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1226.456828] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 1262.102732] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 1262.102734] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 1262.102737] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 1262.102738] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 1262.102745] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 1262.102746] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 1262.102753] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 1262.102754] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 4724.350481] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 4724.350484] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 4724.350487] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 4724.350488] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 4724.350512] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 4724.350513] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 4724.350522] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 4724.350523] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 5126.283390] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 5126.283392] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 5126.283395] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 5126.283396] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 5126.283403] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 5126.283404] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 5126.283412] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 5126.283412] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 5142.260533] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 5142.260535] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 5142.260538] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 5142.260539] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 5142.260545] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 5142.260546] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 5142.260553] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 5142.260554] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[ 5691.839644] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[ 5691.839647] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[ 5691.839650] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[ 5691.839651] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[ 5691.839674] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[ 5691.839675] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[ 5691.839685] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[ 5691.839686] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[40971.122188] perf samples too long (4833 > 2500), lowering kernel.perf_event_max_sample_rate to 50000
[41130.722132] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
[41130.726453] JFS: nTxBlock = 8192, nTxLock = 65536
[41130.734010] NTFS driver 2.1.30 [Flags: R/O MODULE].
[41130.744754] QNX4 filesystem 0.2.3 registered.
[41130.749174] xor: automatically using best checksumming function:
[41130.786908] avx : 23651.000 MB/sec
[41130.854985] raid6: sse2x1 8658 MB/s
[41130.923061] raid6: sse2x2 11220 MB/s
[41130.991184] raid6: sse2x4 13236 MB/s
[41130.991185] raid6: using algorithm sse2x4 (13236 MB/s)
[41130.991186] raid6: using ssse3x2 recovery algorithm
[41131.049154] bio: create slab at 1
[41131.049530] Btrfs loaded
[41141.116452] audit_printk_skb: 75 callbacks suppressed
[41141.116455] type=1400 audit(1476975847.438:47): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/cups/backend/cups-pdf" pid=32779 comm="apparmor_parser"
[41141.116461] type=1400 audit(1476975847.438:48): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/cupsd" pid=32779 comm="apparmor_parser"
[41141.116725] type=1400 audit(1476975847.438:49): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/cupsd" pid=32779 comm="apparmor_parser"
[84116.787975] python (38225): Using mlock ulimits for SHM_HUGETLB is deprecated
[84149.413189] nvidia-uvm: Loaded the UVM driver in 8 mode, major device number 247
[296888.462935] perf samples too long (5128 > 5000), lowering kernel.perf_event_max_sample_rate to 25000
[300664.520252] perf samples too long (10700 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
[302354.408965] perf samples too long (21294 > 20000), lowering kernel.perf_event_max_sample_rate to 6250
[308362.741565] perf samples too long (40374 > 40000), lowering kernel.perf_event_max_sample_rate to 3250
[345491.484531] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[345491.484534] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[345491.484538] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[345491.484539] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[345491.484560] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[345491.484561] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[345491.484570] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[345491.484571] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[345491.486416] init: nv_peer_mem pre-start process (28047) terminated with status 1
[345501.386899] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[345501.386904] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[345501.386908] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[345501.386909] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[345501.386928] nv_peer_mem: disagrees about version of symbol ib_register_peer_memory_client
[345501.386929] nv_peer_mem: Unknown symbol ib_register_peer_memory_client (err -22)
[345501.386939] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[345501.386940] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)

Yes, this issue is because we ship or own symbol version file for the CUDA driver instead of using what is actually installed. We need to either find the symbol file from the CUDA driver build on the system being set up, or use symbol_get as was done at drossetti/nv_peer_memory@ea51a48.

feiga commented

Finally it works. Thanks a lot!

Hi, @feiga , I met the same problem while installing gpu dicrect rdma drivers for cntk in a docker. Could you share how did you solve that issue?

@ferasd has this been fixed in 1.0-5 ?

I encounter another error when running "sudo dpkg -i nvidia-peer-memory-dkms_1.0-5_all.deb":

(Reading database ... 197084 files and directories currently installed.)
Preparing to unpack nvidia-peer-memory-dkms_1.0-5_all.deb ...

------------------------------
Deleting module version: 1.0
completely from the DKMS tree.
------------------------------
Done.
Unpacking nvidia-peer-memory-dkms (1.0-5) over (1.0-5) ...
Setting up nvidia-peer-memory-dkms (1.0-5) ...

Creating symlink /var/lib/dkms/nvidia-peer-memory/1.0/source ->
                 /usr/src/nvidia-peer-memory-1.0

DKMS: add completed.

Kernel preparation unnecessary for this kernel.  Skipping...

Building module:
cleaning build area....
make KERNELRELEASE=4.10.0-37-generic all KVER=4.10.0-37-generic KDIR=/lib/modules/4.10.0-37-generic/build....(bad exit status: 2)
Error! Bad return status for module build on kernel: 4.10.0-37-generic (x86_64)
Consult /var/lib/dkms/nvidia-peer-memory/1.0/build/make.log for more information.
modprobe: FATAL: Module nv_peer_mem not found in directory /lib/modules/4.10.0-37-generic

I check and find that nv_peer_mem is actually in "/lib/modules/4.10.0-37-generic/". The content of the log file "/var/lib/dkms/nvidia-peer-memory/1.0/build/make.log" is:

DKMS make.log for nvidia-peer-memory-1.0 for kernel 4.10.0-37-generic (x86_64)
Mon Oct 23 07:36:57 UTC 2017
/var/lib/dkms/nvidia-peer-memory/1.0/build/create_nv.symvers.sh 4.10.0-37-generic
-W- Could not get list of nvidia symbols.
Found /usr/src/nvidia-384-384.69/nvidia/nv-p2p.h
/bin/cp -f /usr/src/nvidia-384-384.69/nvidia/nv-p2p.h /var/lib/dkms/nvidia-peer-memory/1.0/build/nv-p2p.h
cp -rf /Module.symvers .
cp: cannot stat '/Module.symvers': No such file or directory
Makefile:48: recipe for target 'all' failed
make: *** [all] Error 1

Please help!

I install the development version 1.0.5 on Ubuntu 16.04.

@everyone I solved this problem by installing MLNX_OFED 2.1 (http://www.mellanox.com/page/products_dyn?product_family=26). I don't really know what's going on, but it's in the prerequisites.

@experiencor Which ofed you had before installing MLNX_OFED 2.1 ?

@feiga yup. It's MLNX_OFED 2.1.

@haggaie I cannot understand what you're saying, can you give more explanations? I tried with "drossetti/nv_peer_memory@ea51a48.", but I failed again.

@sj6077 we had an issue that failed linking with the NVIDIA driver in some cases. Eventually the solution was different from the one in the patch cited above. Instead we take the symbol versions of the NVIDIA driver currently installed (see e8d047e).

Anyway, if you are using the latest version, perhaps you should report a new issue, and explain what kind of errors you are seeing.