Mellanox/nv_peer_memory

Are there size limitation on registering GPU memory

heaibao817 opened this issue · 3 comments

When I use this function:
ibv_reg_mr(rc_get_pd(), GPU_ADDR,SIZE, IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ)

If the SIZE is over 500MB ,it will return NULL.

I want to know if this problem caused by the limitation of nv_peer_memory or the RDMA driver.

GPU Tesla T4 Driver Version: 460.91.03 CUDA Version: 11.2
RDMA: mlx5_0

Hi!
I'm investigating this too.
Looks like either a limitation of the nv_peer_memory module or the GPU driver, I can easily bind more than 64GB when using host memory and a Mellanox CX-5... For me it fails around 200MB.
I am using the built-in nvidia_peermem module though, for what's matter.

Accordingly to https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#display-bar-space

[...] It can be used to understand the application usage of BAR space, the primary resource consumed by GPUDirect RDMA mappings.

a certain amount of BAR space is reserved by the driver for internal use, so not all available memory may be usable via GPUDirect RDMA

So we are limited to the bar size that you can get with nvidia-smi -q

Thank you, it has been solved.
The max memory size is the bar size,T4 IS 256MB. When I change to V100 there is no limitation.