Wrapper for MPI-3 shared memory
lukashuebner opened this issue · 2 comments
The MPI-3 standard introduced functions which enable multiple ranks on the same node / NUMA-domain to use shared local memory to communicate.
- Can be used to for example share input data.
- Faster than RDMA put and get, even on a single node.
- Partitioning by NUMA-domain currently not standardized.
- The shared memory region has a different virtual address on the different MPI processes. Some caveats and even more caveats.
Supporting this functionality in KaMPIng could be a unique selling point and very useful. There are probably multiple levels of support:
- Wrap the MPI calls to provide a
shmalloc()
. - Implement a C++ allocator and offset_ptr. These might not work with the SLT but possibly with the Boost containers created specifically for shared memory.
- Provide faster communication using (a) shared send/recv buffers with parallel serialization/deserialization + fewer messages per node (b) faster inner-node communication (MPI seems to have some problems with inner node communication according to early experiments done by @mschimek). All of this would, however, just be a way of avoiding MPI+OpenMP and simplify being able to claim that you're using hybrid parallelization.
For sake of completeness: It seems as if one could also remap the shared memory region to another virtual address. On a 64bit system, there might even be a large enough block of virtual addresses which are available on all ranks, and we'd thus be able to map the shared memory region there and use raw pointers again.
I think this is a very interesting proposal and I share Lukas's view that this could be a beneficial feature for KaMPIng.
(@lukashuebner you mean MPI+OpenMP, don't you?)
(@lukashuebner you mean MPI+OpenMP, don't you?)
Yes, of course 🙊