Mellanox/nv_peer_memory

I encountered a segment error while transmitting with GPU address.

heaibao817 opened this issue · 1 comments

The GDB BackTrace is :
#0 0x00007ffff6d16cb4 in __memcpy_ssse3_back () from /lib64/libc.so.6
#1 0x00007ffc805e7b16 in copy_to_scat (scat=0x7ff9bc18f6e0, buf=buf@entry=0x7ff9bc1894c0, size=size@entry=0x7ffa167fe2ec,
max=max@entry=1, ctx=ctx@entry=0x1c1e8780) at ../providers/mlx5/qp.c:88
#2 0x00007ffc805e7e07 in copy_to_scat (ctx=0x1c1e8780, max=1, size=0x7ffa167fe2ec, buf=0x7ff9bc1894c0, scat=)
at ../providers/mlx5/qp.c:78
#3 mlx5_copy_to_send_wqe (qp=qp@entry=0x7ff9bc18a230, idx=, buf=0x7ff9bc1894c0, size=)
at ../providers/mlx5/qp.c:161
#4 0x00007ffc805e51a4 in mlx5_parse_cqe (lazy=0, cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=,
cur_rsc=, cqe=, cqe64=, cq=) at ../providers/mlx5/cq.c:743
#5 mlx5_poll_one (cqe_ver=1, wc=0x7ffa167fe5a0, cur_srq=, cur_rsc=, cq=)
at ../providers/mlx5/cq.c:904
#6 poll_cq (cqe_ver=1, wc=, ne=, ibcq=0x7ff9bc188d40) at ../providers/mlx5/cq.c:932
#7 mlx5_poll_cq_v1 (ibcq=0x7ff9bc188d40, ne=32, wc=) at ../providers/mlx5/cq.c:1306
#8 0x00007ffce1248ab2 in ibv_poll_cq (wc=0x7ffa167fe5a0, num_entries=32, cq=)
/include/infiniband/verbs.h:2456

It seems like the ibv_poll_cq failed. But when I change to cpu addr, this problem will not happen.
I wonder what happened.