MPI Testing with multiple ranks but only 1 GPU

Question

MPI Testing with multiple ranks but only 1 GPU

Opened this issue a year ago · 0 comments

As of #1090, MPI-backed distribtued ensembles will be implemented, with an MPI-only test suite, which will only striclty test the use of MPI in a multi-gpu scenario.

However, google test is not MPI aware, so there are a number of limitations on this test suite:

Each MPI rank prints it's test output by default, making results hard to interpret.
The final result of the test suite reported by MPI (i.e. the exit code) will be that of rank 0. I.e. if rank 0 passes but rank 1 fails, mythical CI would report it as a success
Telemetry is issued from each rank.
Deadlocks might cause issues, but hopefulyl we won't hit those...
Death tests are not possible with MPI (we should prolly split our death tests out to cmake tests anyway given they are not multithreading safe, and cuda implicitly spawns a few threads).

There are a number of stale out of date google test + mpi repo's on github we could investigate to resolve these issues, or we can roll some custom mpi in main.cu which would deal with them (but not all of them), and not deal with them very well without a lot of effort.

See :

Some quick and dirty improvemetns could be (with big downsides):

In main, we could initialise MPI, get the world size and world rank, and remove the google test listerners from rank 1+, so only rank 0 outputs its status, but then we wouldn't know if other ranks failed, e.g.

// ...
MPI_Init(&argc, &argv);
::testing::TestEventListeners& listeners = ::testing::UnitTest::GetInstance()->listeners();
int world_rank = -1;
MPI_Comm_rank(MPI_COMM_WORLD, &world_rank);
if(world_rank == 0){
    ::testing::TestEventListener *l = listeners.Release(listeners.default_result_printer());
}
auto rtn = RUN_ALL_TESTS();
// ...

In main, make all ranks communicate back to 0 the status of their tests. If any ranks failed, return the collective success. Getting logs out for failed ranks would be a lot of effort however.
Use the collective status for telemetry / to only send from one rank