[Feature Request] Implement MPI collective operations in KOKKOS implementation of KSPACE

Creating this issue to track the implementation of kspace_modify collective yes in Kokkos-enabled KSPACE package.
This has already been implemented in the non-Kokkos version of KSPACE.

Detailed Description

Collective code in the regular (non-Kokkos) node here:
https://github.com/lammps/lammps/blob/develop/src/KSPACE/remap.cpp#L113-L203
which is missing from the Kokkos version: https://github.com/lammps/lammps/blob/develop/src/KOKKOS/remap_kokkos.cpp#L106

The Kokkos version needs to support both GPU-aware MPI on and off, which is already done for the point-to-point version.

Further Information, Files, and Links

@stanmoore1 I started working on this and had a quick question -- do you know why the send/recv buffers for the Alltoallv are allocated during the execution stage of the remap (

lammps/src/KSPACE/remap.cpp

Line 128 in 6836bca

    
           auto packedSendBuffer = (FFT_SCALAR *) malloc(sizeof(FFT_SCALAR) * sendBufferSize);

) while the buffers for the MPI_Send/Irecv's are allocated during the plan phase (

lammps/src/KSPACE/remap.cpp

Line 586 in 6836bca

plan->sendbuf = (FFT_SCALAR *) malloc(size*sizeof(FFT_SCALAR));

)?

I am considering moving the Alltoallv buffer allocation to be in the same place as the Send/Irecv's to avoid excessive memory allocations, since we do know the sizes ahead of time. Thanks!

@hagertnl I think that would be a good optimization--I don't see a reason to allocate memory inside the execution stage.