[BUG] Major performance regression in Special.build() component of replicate at large scale (512 nodes+)

Summary

When replicating a molecular system (atom_style real), Special.build (from special.cpp) is called to build 1-2, 1-3, 1-4 lists and perform tagging. At the end of the dedup and combine methods called during this process, the atom->map_set function is called.

When using Kokkos backend, this is implemented in src/KOKKOS/atom_map_kokkos.cpp. This was recently refactored as part of commit #806af53, which resulted in a 25x slowdown at scale on the GPUs when compared to the stable_23Jun2022_update4 version of LAMMPS. When using the Serial backend of Kokkos, the performance is actually improved between these two commits, so this seems like it could possibly be a Kokkos performance bug. Non-Kokkos atom mapping is also performing well at scale.

By use of internal code timers and tracing libraries, we have identified the following chunk of code inside of map_set as taking >95% of this replicate time at scale only on rank 0 (line 198):

  mapSorter.create_permute_vector(LMPDeviceType());
  mapSorter.sort(LMPDeviceType(), d_tag_sorted, 0, nall);
  mapSorter.sort(LMPDeviceType(), d_i_sorted, 0, nall);

All other ranks complete this section of code in less than 5 seconds, but rank 0 takes 300+ seconds. I suspect rank imbalance, but I haven't dealt with Kokkos code enough to be able to identify the exact issue.

Note that profiling libraries often flag this as an MPI issue with either Barrier or Allreduce, but that's just a red herring because the other ranks are all waiting on rank 0, and the waiting happens to be in MPI synchronization.

Performance data provided at the bottom of this ticket.

LAMMPS Version and Platform

Current develop branch tested on OLCF's Frontier and Summit. Data at the bottom of this ticket is provided from Summit, but similar (and worse scaling) has been observed on Frontier.

Expected Behavior

Performance-neutral or improved performance from stable_23Jun2022_update4 to current.

Actual Behavior

25x slow-down at scale.

Steps to Reproduce

w32k.data.gz
Data file attached (bulk water system with 98,304 atoms). Replicate this 4x4x4 to get a single-node sized problem, 8x8x8 to get the 8-node problem, etc., since scaling is done in powers of 8. Replication is done by defining the nx, ny, nz variables at run-time. Input file:

variable        nx index 1
variable        ny index 1
variable        nz index 1

units           real
atom_style      full
pair_style      lj/cut/coul/long 9.0
bond_style      harmonic
angle_style     harmonic
kspace_style    pppm 1.e-5
kspace_modify   collective yes

read_data       w32k.data.gz
include         spce.ff.lmp

replicate       ${nx} ${ny} ${nz} bbox

Further Information, Files, and Links

On Summit, we run 6 MPI ranks per node (1 V100 per rank), and this problem is weak-scaled with 6.29 million atoms per node (~1mil atoms per rank).
This table provides the time it takes to replicate, and in parentheses the time that the Special build takes.

What	Nnodes	current develop	stable_23Jun2022_update4
Default CPU-based	1	1.042 (0.699)	1.010 (0.653)
Default CPU-based	8	1.269 (0.772)	1.188 (0.708)
Default CPU-based	512	1.444 (0.784)	1.296 (0.730)
Default CPU-based	4096	6.206 (4.037)	10.091 (5.338)
Serial-Kokkos	1	5.241 (4.681)	13.093 (1.140)
Serial-Kokkos	8	5.637 (4.958)	13.170 (1.226)
Serial-Kokkos	512	6.153 (5.401)	19.515 (1.290)
Serial-Kokkos	4096	13.12 (8.333)	25.454 (1.970)
GPU-Kokkos	1	4.798 (4.296)	3.731 (1.145)
GPU-Kokkos	8	5.078 (4.433)	3.933 (1.213)
GPU-Kokkos	512	16.194 (13.269)	5.716 (1.903)
GPU-Kokkos	4096	423.27 (331.1)	16.378 (4.815)

Here is a permalink to the code I referenced:

lammps/src/KOKKOS/atom_map_kokkos.cpp

Lines 198 to 200 in c6a8f1f

    
           mapSorter.create_permute_vector(LMPDeviceType()); 
        
           mapSorter.sort(LMPDeviceType(), d_tag_sorted, 0, nall); 
        
           mapSorter.sort(LMPDeviceType(), d_i_sorted, 0, nall);

I also did some debugging within this map_set method and found that the values of min and max are drastically different between rank 0 and every other rank -- rank 0's min/max is 1 and the largest atom tag in the global system, while every other rank seems to have a much smaller min max.
For example, in a run with 25 billion atoms spread across just 32 nodes x 56 ranks per node on Frontier (it fits in memory at this stage, I'm not endorsing this as a good idea though), which is 14 mil atoms per rank, here are some reported min/max's and nall values:

Rank 0: nall=14377536, nmax=15826944, min=1, max=25769799246
Rank 1: nall=14377536, nmax=15826944, min=1207960849, max=3618372585
Rank 2: nall=14377536, nmax=15826944, min=2818573585, max=5228985321

nall is responsible for the map lengths that are being sorted, it seems, so I was looking to see if it was different for rank 0 or not. But max-min for ranks 1 and 2 is 10x less than rank 0.

@hagertnl sorry to hear about the slowdown. Just to be sure, you are using a "hash" style atom map, correct? This looks like a bug; rank 0 should not be trying to sort 25 billion tags all by itself. I will take a look.

Hi @stanmoore1 , no problem, I know this is an edge case from the most common LAMMPS usage. I typically let the atom map style default, and with the base system at 98k atoms, it sounds like it probably defaults to array.

That said, I re-ran the 4096-node GPU-Kokkos problem that took 420 seconds with atom_modify map hash explicitly set after atom_style full and it took 420 seconds on the nose.

I also tried explicitly with atom_modify map array and got a crash:

(CudaInternal::singleton().cuda_device_synchronize_wrapper()) error( cudaErrorIllegalAddress): an illegal memory access was encountered ../../lib/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154

I also tried explicitly with atom_modify map array and got a crash:

This has to crash. You don't have enough RAM and indexing with signed 32bit integers will fail. LAMMPS defaults to using map style array until about 1 Million atoms and then it switches to map style hash. Unless a user make a different selection with the atom_modify command...

This issue would be well fixed by kokkos/kokkos#6801 bypassing Kokkos::BinSort, but it won't be released for a while. In the meantime, it may make sense to just have an option to build the atom map on the CPU instead of GPU.

How would be best to do that? Last time I tried that, I got errors saying that the Kokkos-enabled atom map must be used with the Kokkos pair styles and fixes, so I couldn't for example turn the suffix off around atom_style. I may have been doing something wrong.

So this looks like a Kokkos performance bug, not any issue with LAMMPS source?

@hagertnl I think this is more a suboptimal use of Kokkos::BinSort, were it becomes inefficient due to having large bins on rank 0 due to the huge difference in the min and max tags. Using sort_by_key would not have the same problem because there are no bins. I will need to modify the code to add a new package option to build the map on the CPU.

Do you have an idea of timeline of when that might be available to try out? We have an internal deadline approaching, I might just need to revert back to the older source before the changes and cherry-pick some of the integer overflow fixes we need.

Thanks for all your help on this!

Or, is there a patch we can put in place that forces the atom map to be on the CPU in the meantime until a nice package option becomes available?

I will look at the code today.

@hagertnl just curious, when you use 32 nodes x 56 ranks per node what is your MPI decomposition output from LAMMPS? E.g. 2 by 2 by 2 MPI processor grid for 8 ranks.

@stanmoore1 8 by 14 by 16 MPI processor grid

I still don't understand why rank 0 has tags ranging from 1 to N with that decomposition at initialization, but I think the issue would be similar after running for a while and atoms migrate around and a proc by chance gets atoms with both high and low tags, which seems to make Kokkos::BinSort run very inefficiently. There are some other Kokkos sorting options but none support AMD/HIP yet (see kokkos/kokkos#6793). For Frontier, I think just reverting back to the host code is the best temporary workaround. I will try to submit a PR in a bit.

The decomposition at 4096 Summit nodes is 24 by 32 by 32 MPI processor grid, if that's helpful to know.

I didn't understand that either. I was wondering if it had something to do with the way the replicate worked out, where rank 0 owned the lowest numbered atoms and happened to also own the highest numbered atoms, too. But I hadn't looked into it that much.

For the context of our current needs, just forcing all that back to the CPU is good with me.

	mapSorter.create_permute_vector(LMPDeviceType());
	mapSorter.sort(LMPDeviceType(), d_tag_sorted, 0, nall);
	mapSorter.sort(LMPDeviceType(), d_i_sorted, 0, nall);