proteneer/timemachine

Optimizations

Closed this issue · 7 comments

  1. Get benchmarking scaffold up.
  2. LJ eps sweep for TIP3P waters via __ballot check - 10x speedup. NB list deferred until we do proteins.
  3. ES - hilbert curve re-ordering and resorting. Optimize erfc. Customize a kernel for only force calculations.

Starting benchmark for 2997 water particles:

551.53us  519.45us  570.04us  void k_electrostatics<float>(int, double const *, double const *, double const *, double, int const *, int const *, double, double, double const *, double const *, __int64*, double*, double*, double*)
449.27us  448.86us  449.79us  void k_lennard_jones_inference<float>(int, double const *, double const *, double const *, double, int const *, int const *, double, double const *, double const *, __int64*, double*, double*, double*)

New LennardJones:

89.374us  89.374us  89.374us  void k_lennard_jones_inference<float>(int, double const *, double const *, double const *, double, int const *, int const *, double, double const *, double const *, int const *, __int64*, double*, double*, double*)

we went from 448us -> 90us! 20% of the time for a 5x speedup. Electrostatics next, this will be a lot trickier.

total ixns: 3539677/40170496 => implied per tile density of 0.08
total empty tiles: 17045/39229 => UGH

We need to achieve an implied tile density of around 0.50 to be competitive with OpenMM

Sigh time for the bane of my existence - findInteractingBlocks.cu - good thing I wrote the doc string for it 8 years ago

Some brief notes on nblist ixns, mainly for myself:

For a randomly chosen block of hilbert-ordered atoms, we compare atoms in the block against every atom, this is the distribution of interactions:

image

There are total 511 atoms that interact with at least 1 atom in the block, which equates to 16 tiles of work. If we do the cumsum of the above:

image

By the time we get to 78% (400/511) of the atoms we would've only covered about half of all the interactions

In fact there are 198 atoms that have less than 6 interactions with the block of 32. If we process them separately (like we do for exclusions), we can reduce this to 511-198=313 atoms, or 10 tiles of work.

The low work list will be processed differently.

wip in #264

Done