Wrapper for MPI-3 shared memory

Question

Wrapper for MPI-3 shared memory

lukashuebner opened this issue 2 years ago · 2 comments

The MPI-3 standard introduced functions which enable multiple ranks on the same node / NUMA-domain to use shared local memory to communicate.

Can be used to for example share input data.
Faster than RDMA put and get, even on a single node.

Partitioning by NUMA-domain currently not standardized.
The shared memory region has a different virtual address on the different MPI processes. Some caveats and even more caveats.

Supporting this functionality in KaMPIng could be a unique selling point and very useful. There are probably multiple levels of support:

Wrap the MPI calls to provide a shmalloc().
Implement a C++ allocator and offset_ptr. These might not work with the SLT but possibly with the Boost containers created specifically for shared memory.
Provide faster communication using (a) shared send/recv buffers with parallel serialization/deserialization + fewer messages per node (b) faster inner-node communication (MPI seems to have some problems with inner node communication according to early experiments done by @mschimek). All of this would, however, just be a way of avoiding MPI+OpenMP and simplify being able to claim that you're using hybrid parallelization.

For sake of completeness: It seems as if one could also remap the shared memory region to another virtual address. On a 64bit system, there might even be a large enough block of virtual addresses which are available on all ranks, and we'd thus be able to map the shared memory region there and use raw pointers again.

Answer 1 · 2022-06-27T16:53:34.000Z

I think this is a very interesting proposal and I share Lukas's view that this could be a beneficial feature for KaMPIng.
(@lukashuebner you mean MPI+OpenMP, don't you?)

Answer 2 · 2022-06-28T06:46:32.000Z

(@lukashuebner you mean MPI+OpenMP, don't you?)

Yes, of course 🙊