Consistent issues with the RingMPICUDA initialization
benson31 opened this issue · 5 comments
The RingMPICUDA ring is consistently failing on most platforms (I think I've seen issues on sierra and pascal). I have this commented out in the MPI-CUDA communicator in my local branches. Perhaps it would be better to lazily initialize it, esp. since it's heap-alloc'd anyway. I'll update this issue if I do any of the debugging legwork to more precisely characterize the error. Superficially, a Sendrecv is called where at least one of the ranks is gibberish in the sense that it's often negative and also way more than 1, 2, or 3 digits long...
I get the error "Failed to determine the number of ranks per node”, which comes from:
Aluminum/src/mpi_cuda/util.hpp
Line 143 in 0e6e585
Instead of relying on environment variables, it would be more robust to split communicators based on hashes of the node names, similar to the approach in LBANN.
More robust is to use the method the MPICommunicator
already supports, to split communicators based on shared-memory access.
LBANN with Hydrogen with MPI-CUDA Aluminum now works for me.
Closing since we have removed the custom ring allreduces.