Thread safety

Question

Thread safety

ndryden opened this issue 4 years ago · 2 comments

Aluminum currently does not provide a documented thread-safety guarantee. In practice, different backends implicitly provide different safety guarantees. I believe the main bottleneck is with the progress engine, which currently uses a single-producer/single-consumer queue.

We should:

Provide an explicit, documented standard for thread safety. I propose that, for performance reasons, we allow this to be chosen at compile-time to be either:
- Multiple threads can simultaneously submit communication operations. If the operations are on the same communicator or compute stream, the ordering must be provided by the user.
- Only a single thread submits communication at a time. (We can allow this to be different threads/etc.-- it is up to the user to synchronize.)
Implement the above. I believe the main task is to improve the queues the progress engine uses and ensure memory pools are thread-safe.

Things that are not necessarily thread-safe:

Internal CUDA stuff. (See, e.g., NCCLRequest).
Memory pools (depending on setting).
Progress engine.

Answer 1 · 2022-04-25T21:50:52.000Z

Current testing indicates we are now thread-safe. 🎉

A number of PRs have contributed to this:

#124 (support for serializing all MPI calls to one thread)
#125 (thread-safety for CUDA resources)
#126 (new memory pools)
#127 (AL_THREAD_MULTIPLE)
#128 (testing with multiple threads)
Bugfixes: #130, #135

I will note for posterity that we were previously seeing rare errors with multiple threads and the host-transfer backend. I am not currently seeing those.

Remaining todo is to update the docs.

Answer 2 · 2022-04-26T20:40:27.000Z

With #148 merged, I'm declaring this complete.

Any specific problems can be separate issues.