Mellanox/nv_peer_memory

create_nv.symvers.sh is broken

e-ago opened this issue · 7 comments

e-ago commented

This change 25774c3#diff-bdbe24543d2311a2bc6b64a3d102fc31L90 returns the wrong symbols' version:

Getting symbol versions from /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia.ko ...
0x000000004c9ba34e  nvidia_p2p_destroy_mapping  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000095397da4  nvidia_p2p_dma_map_pages    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000043c5682e  nvidia_p2p_dma_unmap_pages  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x00000000adf40bc1  nvidia_p2p_free_dma_mapping /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x000000005868c8aa  nvidia_p2p_free_page_table  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x000000001c254be6  nvidia_p2p_get_pages    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x00000000d186c986  nvidia_p2p_get_rsync_registers  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x00000000b73bde45  nvidia_p2p_init_mapping /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x000000007e399228  nvidia_p2p_put_pages    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000051395286  nvidia_p2p_put_rsync_registers  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x000000005d649138  nvidia_p2p_register_rsync_driver    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x000000009718676b  nvidia_p2p_unregister_rsync_driver  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009c80  nvidia_p2p_destroy_mapping  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009f20  nvidia_p2p_dma_map_pages    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009b60  nvidia_p2p_dma_unmap_pages  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009160  nvidia_p2p_free_dma_mapping /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009c70  nvidia_p2p_free_page_table  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009470  nvidia_p2p_get_pages    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009930  nvidia_p2p_get_rsync_registers  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009df0  nvidia_p2p_init_mapping /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009d20  nvidia_p2p_put_pages    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x00000000000098c0  nvidia_p2p_put_rsync_registers  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009860  nvidia_p2p_register_rsync_driver    /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia
0x0000000000009be0  nvidia_p2p_unregister_rsync_driver  /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia

And:

$ cat Module.symvers  | egrep nvidia_p2p
0x00009160	nvidia_p2p_free_dma_mapping	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009470	nvidia_p2p_get_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009860	nvidia_p2p_register_rsync_driver	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009f20	nvidia_p2p_dma_map_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009df0	nvidia_p2p_init_mapping	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009be0	nvidia_p2p_unregister_rsync_driver	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009c80	nvidia_p2p_destroy_mapping	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009d20	nvidia_p2p_put_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009930	nvidia_p2p_get_rsync_registers	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009b60	nvidia_p2p_dma_unmap_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x00009c70	nvidia_p2p_free_page_table	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x000098c0	nvidia_p2p_put_rsync_registers	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)

The result is:

[2385461.944989] nv_peer_mem: disagrees about version of symbol nvidia_p2p_dma_unmap_pages
[2385461.944993] nv_peer_mem: Unknown symbol nvidia_p2p_dma_unmap_pages (err -22)
[2385461.945013] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[2385461.945014] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[2385461.945026] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[2385461.945028] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[2385461.945084] nv_peer_mem: disagrees about version of symbol nvidia_p2p_dma_map_pages
[2385461.945085] nv_peer_mem: Unknown symbol nvidia_p2p_dma_map_pages (err -22)
[2385461.945096] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_dma_mapping
[2385461.945097] nv_peer_mem: Unknown symbol nvidia_p2p_free_dma_mapping (err -22)
[2385461.945107] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[2385461.945109] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)
[2385489.780058] nv_peer_mem: disagrees about version of symbol nvidia_p2p_dma_unmap_pages
[2385489.780062] nv_peer_mem: Unknown symbol nvidia_p2p_dma_unmap_pages (err -22)
[2385489.780081] nv_peer_mem: disagrees about version of symbol nvidia_p2p_get_pages
[2385489.780082] nv_peer_mem: Unknown symbol nvidia_p2p_get_pages (err -22)
[2385489.780094] nv_peer_mem: disagrees about version of symbol nvidia_p2p_put_pages
[2385489.780096] nv_peer_mem: Unknown symbol nvidia_p2p_put_pages (err -22)
[2385489.780150] nv_peer_mem: disagrees about version of symbol nvidia_p2p_dma_map_pages
[2385489.780151] nv_peer_mem: Unknown symbol nvidia_p2p_dma_map_pages (err -22)
[2385489.780162] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_dma_mapping
[2385489.780164] nv_peer_mem: Unknown symbol nvidia_p2p_free_dma_mapping (err -22)
[2385489.780174] nv_peer_mem: disagrees about version of symbol nvidia_p2p_free_page_table
[2385489.780175] nv_peer_mem: Unknown symbol nvidia_p2p_free_page_table (err -22)

The fix is to revert the commit with: done < <(nm -o $nvidia_mod | grep "__crc_nvidia_p2p_")

I'm not sure I follow. What would the expected output of the script be in this case?

e-ago commented

The output would be:

$ cat Module.symvers  | egrep nvidia_p2p
0xadf40bc1	nvidia_p2p_free_dma_mapping	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x1c254be6	nvidia_p2p_get_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x5d649138	nvidia_p2p_register_rsync_driver	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x95397da4	nvidia_p2p_dma_map_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0xb73bde45	nvidia_p2p_init_mapping	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x9718676b	nvidia_p2p_unregister_rsync_driver	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x4c9ba34e	nvidia_p2p_destroy_mapping	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x7e399228	nvidia_p2p_put_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0xd186c986	nvidia_p2p_get_rsync_registers	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x43c5682e	nvidia_p2p_dma_unmap_pages	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x5868c8aa	nvidia_p2p_free_page_table	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)
0x51395286	nvidia_p2p_put_rsync_registers	/lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia	(unknown)

Sorry. I don't follow. Where do the different addresses come from? And why are those 32 bit addresses?

e-ago commented

@tzafrir-mellanox I'm sorry, let me rephrase that. With the latest commit the script is no more using the CRC checksum of nvidia.ko exported symbols.

$ nm -a /lib/modules/4.4.0-141-generic/kernel/drivers/video/nvidia.ko | egrep p2p
000000004c9ba34e A __crc_nvidia_p2p_destroy_mapping
0000000095397da4 A __crc_nvidia_p2p_dma_map_pages
0000000043c5682e A __crc_nvidia_p2p_dma_unmap_pages
00000000adf40bc1 A __crc_nvidia_p2p_free_dma_mapping
000000005868c8aa A __crc_nvidia_p2p_free_page_table
000000001c254be6 A __crc_nvidia_p2p_get_pages
00000000d186c986 A __crc_nvidia_p2p_get_rsync_registers
00000000b73bde45 A __crc_nvidia_p2p_init_mapping
000000007e399228 A __crc_nvidia_p2p_put_pages
0000000051395286 A __crc_nvidia_p2p_put_rsync_registers
000000005d649138 A __crc_nvidia_p2p_register_rsync_driver
000000009718676b A __crc_nvidia_p2p_unregister_rsync_driver
...
0000000000009c80 T nvidia_p2p_destroy_mapping
0000000000009f20 T nvidia_p2p_dma_map_pages
0000000000009b60 T nvidia_p2p_dma_unmap_pages
0000000000009160 T nvidia_p2p_free_dma_mapping
0000000000009c70 T nvidia_p2p_free_page_table
0000000000009470 T nvidia_p2p_get_pages
0000000000009930 T nvidia_p2p_get_rsync_registers
0000000000009df0 T nvidia_p2p_init_mapping
00000000000002d8 D nvidia_p2p_page_cache_name
00000000000004c0 r nvidia_p2p_page_size_mappings
0000000000000850 B nvidia_p2p_page_t_cache
0000000000009d20 T nvidia_p2p_put_pages
00000000000098c0 T nvidia_p2p_put_rsync_registers
0000000000009860 T nvidia_p2p_register_rsync_driver
0000000000009be0 T nvidia_p2p_unregister_rsync_driver

Disclaimer: I don't know much about this code. I know how to write scripts.

On Ubuntu 18.04 and on Ubuntu 19.10 the new version is needed. On RHEL 7.4 and on Ubuntu 16.04 it breaks.

Specifically the following fixes the issue for the "older" systems:

-modules_pat="crc_nvidia_p2p|T nvidia_p2p"
+modules_pat="_crc_nvidia_p2p"

Therefore the problem seems to be indeed with the newly-added nvidia_p2p_* symbols and not with any bug in the script.

So I can fix the script to only provide the non-crc symbols if there are no crc symbols. It should work for both platforms. But I really don't understand why.

A fix along those lines: #60

fixed by #60 closing