LLNL/Aluminum

Consistent issues with the RingMPICUDA initialization

benson31 opened this issue · 5 comments

The RingMPICUDA ring is consistently failing on most platforms (I think I've seen issues on sierra and pascal). I have this commented out in the MPI-CUDA communicator in my local branches. Perhaps it would be better to lazily initialize it, esp. since it's heap-alloc'd anyway. I'll update this issue if I do any of the debugging legwork to more precisely characterize the error. Superficially, a Sendrecv is called where at least one of the ranks is gibberish in the sense that it's often negative and also way more than 1, 2, or 3 digits long...

I get the error "Failed to determine the number of ranks per node”, which comes from:

std::cerr << "Failed to determine the number of ranks per node" << std::endl;

Instead of relying on environment variables, it would be more robust to split communicators based on hashes of the node names, similar to the approach in LBANN.

More robust is to use the method the MPICommunicator already supports, to split communicators based on shared-memory access.

The local rank size problem is addressed in #7. What other problems have you seen?

LBANN with Hydrogen with MPI-CUDA Aluminum now works for me.

Closing since we have removed the custom ring allreduces.