Syndrom 0x51
farhanma opened this issue · 3 comments
When I ran the perftest bidirectional/unidirectional between two GPUs communicating over a PCIe link, after certain iterations, I got the following error:
Completion with error at client
Failed status 4: wr_id 0 syndrom 0x51
scnt=828, ccnt=700
Failed to complete run_iter_bw function successfully
With a little bit of Googling I found out that this error somehow is related to ibv_post_send
and ibv_poll_cq
Linux system call written by Mellanox folks. Did anyone encounter such error before? Thank you.
We also encountered this problem.
- which GPU you are using?
- what exact commands you used on both sides?
- Have you tried using 'use_cuda' only on one side?
Thanks
- NVIDIA A100-SXM4-80GB W HS
### master
./ib_write_bw -p <port_number> -a -b -F -d <ib_card_ip_address> --report_gbits -i 1 --use_cuda=<device_id>
### slave
./ib_write_bw -p <port_number> -a -b -F -d <ib_card_ip_address> --report_gbits -i 1 --use_cuda=<device_id> <master_ip_address>
- No I've not. I can try that and update the GitHub issue.
It may be relate to the MMIO base in the system BIOS of the HV.
please try this solution: https://www.dell.com/support/manuals/en-il/vmware-esxi-6.5.x/esxi6.5.x_rn_pub/virtual-machines-fail-to-power-on-when-system-bios-has-mmio-set-to-56-tb-with-supported-gpu-config?guid=guid-ab3ea7a8-b8ca-481a-b6e2-d83ab989dac5