Optimizations
Closed this issue · 7 comments
- Get benchmarking scaffold up.
- LJ eps sweep for TIP3P waters via __ballot check - 10x speedup. NB list deferred until we do proteins.
- ES - hilbert curve re-ordering and resorting. Optimize erfc. Customize a kernel for only force calculations.
Starting benchmark for 2997 water particles:
551.53us 519.45us 570.04us void k_electrostatics<float>(int, double const *, double const *, double const *, double, int const *, int const *, double, double, double const *, double const *, __int64*, double*, double*, double*)
449.27us 448.86us 449.79us void k_lennard_jones_inference<float>(int, double const *, double const *, double const *, double, int const *, int const *, double, double const *, double const *, __int64*, double*, double*, double*)
New LennardJones:
89.374us 89.374us 89.374us void k_lennard_jones_inference<float>(int, double const *, double const *, double const *, double, int const *, int const *, double, double const *, double const *, int const *, __int64*, double*, double*, double*)
we went from 448us -> 90us! 20% of the time for a 5x speedup. Electrostatics next, this will be a lot trickier.
total ixns: 3539677/40170496 => implied per tile density of 0.08
total empty tiles: 17045/39229 => UGH
We need to achieve an implied tile density of around 0.50 to be competitive with OpenMM
Sigh time for the bane of my existence - findInteractingBlocks.cu - good thing I wrote the doc string for it 8 years ago
Some brief notes on nblist ixns, mainly for myself:
For a randomly chosen block of hilbert-ordered atoms, we compare atoms in the block against every atom, this is the distribution of interactions:
There are total 511 atoms that interact with at least 1 atom in the block, which equates to 16 tiles of work. If we do the cumsum of the above:
By the time we get to 78% (400/511) of the atoms we would've only covered about half of all the interactions
In fact there are 198 atoms that have less than 6 interactions with the block of 32. If we process them separately (like we do for exclusions), we can reduce this to 511-198=313 atoms, or 10 tiles of work.
The low work list will be processed differently.
Done