rom1504/gpu-tester

Implement pair testing

rom1504 opened this issue · 1 comments

Testing N nodes together will only tell us that any one of them has an interconnect issue.

To find the node with such an issue, it's faster to do pair testing: for example with 500 nodes, run 250 pair jobs and measure speed. For all the ones that are significantly slower go to phase 2, options:

  • retest them with one of the working node
  • Randomly shuffle until we know what's working and what's not

This should be able to identify which nodes work fast in a distributed training setting

done