LB: avoid double communicaiton
RudolfWeeber opened this issue · 5 comments
In LB a lot of stuff is send aroudn twice in ghost comm:
Push scheme:
- reset force (requires and changeslocal forces)
- collides: requires local forces and pdfs
- pdf comm: communicats pdfs and forces
- boundary: requres pdfs and fores (for velocity calculation)
- streaming: requires pdfs, writes velocity field used by particle coupling
- full comm: communicates pdfs, forces, velocities
The final communication step is only there for particle coupling, which needs up to date velocities in ghost layers. Forces could be removed from that communication. The velocities could also be removed, as that info can be re-calculatd from the pdfs on the other side.
Alternatively, the pdfs could be removed, as the velocities are communicated anyway. The only client of pdfs in ghost layers outside LB is the get_density_at_pos() function. If this is needed, the ghost comm could be done lazily once this is called, or a density field could be added to the output in the streaming step.
Pull scheme:
Here, there is only one communicatnio at the end, communicating pdfs, velocities and forces.
The velocities could in principle be removed and re-calculated on the other side.
Right now, the final communication in the push and the pull scheme are done by the same funciton (ghost_communicaiton_pdfs), although the schemes have different comm requirements.
I'd suggest the following:
- work on the push scheme only
- define a separate communcaotr which communicates only the velocity field
- use that instead of the full communicaotr a thte end of integrate_push() and check improvement in performance and scaleability.
- If the improvement gained form that is relevant, the get_interpolated_density() needs to be fixed or removed (it relies on the now missing ghost communication of the full pdf). The easiest fix is to trigger a full ghost communicaiot of the pdfs when someune queries an interpolated density. Its slow, but I'm not aware of a cuse case of that getter, anyway.
For a 48x48x48 LB grid, removing the pdfs and forces from the communicator improves runtime significantly:
- 25% faster on CPU with 8 cores
- 40% faster on GPU with 1 core or 8 cores
Similar observation with particle coupling enabled, using the defaults: 30x30x30 box with 125 particles per core.
I introduced a std::bitset
where individual bits are addressed using an enum value to keep track of which fields have outdated halos. Python setter functions automatically do a full LB ghost communication after changing fields, so that getter functions always access correct data. But in C++ unit tests, one has to manually call the LB ghost communication after any call to a setter function or the integration sweep.
Assertions were introduced to make sure getter functions never access outdated halo cells if the consider_ghosts
flag is set to true
. While the full testsuite passes with these assertions, more testing is needed, in particular to figure out if the LB observables need to manually call the LB ghost update before collecting data. I'll also run more thorough benchmarks.
Concerning observabels:
- if they only use info from the lattice sites (total momenutm, etc), probably not, as each MPI rank can read from its local LB cells and no info from ghost laers is needed
- velocity profile at arbitrary posiitons; probably not, should use same interpolated velociyt getter as particle coupling, which relies on already ghost-communicated pre-calculated velocity field
- density and momentum density profile at arbitrary positions: probably yes, due to the density which we calculate direclyt from PDFs, which we no longer ghost communicate.
Great that it worked out. Thsi probably means, we should put more effort into reducing communication volume. The next steps would be:
In the m_pdf_streaming_communicator replace the Packinfo which packs the entire field by a generated pack info which only packs the data actually used on the other side by the streaming kernel (lbmpy_walberla.packinfo.generate_lb_pack_info does this, I think)
Notes:
- according to doc string, passing the collisiom kernel to the function as well will add the other fields the collision kernel needs to the pack info. This is the last_applied_force in our case. It should then be possible to remove the separate PackInfo for the last_applied_force from the m_pdf_streaming_communicator as well.
- I think we need streaming pattern "pull", in spite of theis affecitng integrate_push_scheme. Not sure.
- If using teh generated pack info improves peroformance, the next step would be to have an AVX-vectorized version as well
Here are the benchmark results:
We are now outperforming the ESPResSo 4.2 LB implementation up to and including 16 cores on the CPU. There is no AVX implementation of the PackInfo, since pystencils generates the exact same code with and without AVX. It's also unclear to me, whether AVX would be beneficial here, since the data in the bufffer doesn't have the same alignment as in the pdfs. Maybe for vector fields it would make sense, but that buffer is one order of magnitude smaller that the pdfs buffer, and we would need to increase the buffer size by 33% to get the proper alignment, not mentioning the changes to the field itself.
The linked PR raised new questions about and MPI buffers data alignment and UBB interactions with the streaming step communication. I invested two days on solving these issues, but could only come up with temporary workarounds. It is unclear if it's worth pursuing this investigation further, as we already agreed on switching to the pull scheme in the near future.