linux-rdma/perftest

Syndrom 0x51

farhanma opened this issue · 3 comments

When I ran the perftest bidirectional/unidirectional between two GPUs communicating over a PCIe link, after certain iterations, I got the following error:

Completion with error at client
  Failed status 4: wr_id 0 syndrom 0x51
scnt=828, ccnt=700
  Failed to complete run_iter_bw function successfully

With a little bit of Googling I found out that this error somehow is related to ibv_post_send and ibv_poll_cq Linux system call written by Mellanox folks. Did anyone encounter such error before? Thank you.

We also encountered this problem.

  1. which GPU you are using?
  2. what exact commands you used on both sides?
  3. Have you tried using 'use_cuda' only on one side?

Thanks

  1. NVIDIA A100-SXM4-80GB W HS
### master
./ib_write_bw -p <port_number> -a -b -F -d <ib_card_ip_address> --report_gbits -i 1 --use_cuda=<device_id>

### slave
./ib_write_bw -p <port_number> -a -b -F -d <ib_card_ip_address> --report_gbits -i 1 --use_cuda=<device_id> <master_ip_address>
  1. No I've not. I can try that and update the GitHub issue.