benchmark --verify error
Opened this issue · 2 comments
Hi!
I'm testing the benchmark program. When I use the --verify flag, I am getting some complaints.
what(): [enforce fail at /home/ubuntu/gloo/gloo/benchmark/main.cc:91] T(offset + expected) == input[i]. 2.4e+07 vs 2.4e+07. Mismatch at index: 375000
terminate called after throwing an instance of 'gloo::EnforceNotMet'
The command I used is:
benchmark -s ${totalClients} -r ${idx} -h xxx.xxx.xxx.xxx -p 6379 -t tcp --sync true --inputs 1 --elements 100\ 0000 --iteration-count 1 --verify allreduce_ring_chunked
I ran this across 8 machines, so ${totalClients}=8, and ${idx} range from 0-7.
Did I do something obviously wrong?
This is running on Ubuntu, and has the latest master checked out.
Thanks!
Hi there! Thanks for reporting the issue. I think this is expected given the number of elements you're using in your test in combination with the number of machines. The verification code fills all input buffers with sequential values, offset by the collective rank, and strided by the collective size. I think that at index 375000 we run into numerical mismatches for floating point numbers. To fix this we should only use integer math in verification mode.
Thank you! :)