Implement pair testing
rom1504 opened this issue · 1 comments
rom1504 commented
Testing N nodes together will only tell us that any one of them has an interconnect issue.
To find the node with such an issue, it's faster to do pair testing: for example with 500 nodes, run 250 pair jobs and measure speed. For all the ones that are significantly slower go to phase 2, options:
- retest them with one of the working node
- Randomly shuffle until we know what's working and what's not
This should be able to identify which nodes work fast in a distributed training setting
rom1504 commented
done