hpcgarage/spatter

CPU performance of new/old Spatter varies

Closed this issue · 5 comments

We have noted that the refactor of Spatter #165 has inconsistent CPU performance for the serial and OpenMP backends.

With the addition of #165, we will have rough performance parity for Gather and MultiGather.

We need to ensure performance parity with v1.1 on CPU for the following kernels:

  • Scatter
  • MultiScatter
  • GatherScatter

To replicate:

  • Run cpu-ustream.json with both the new and old Spatter on test platforms - CLX, SKL

It turns out this issue was caused by compiling with CMAKE_BUILD_TYPE set to Debug instead of Release.

Performance on Skylake, 24 threads, with input, -pUNIFORM:8:1 -l$((2**24)), averaged over 10 runs:

Current:
Max: 92774.52, Mean: 88883.00, Stddev: 4764.36

Refactor:
Max: 92982.30, Mean: 84435.54, Stddev: 6467.54

The refactor does seem to have consistently lower mean and higher standard deviation across runs, though.

Once I have pushed my branch to the refactor branch we can close this issue. I just need to make sure I have properly implemented the "multiple target buffers" feature for all of the kernels.

From #165 note that we need to check Scatter, Multiscatter, Gather-Scatter tests.

Hi @radelja - can you please test out Scatter? We think there may be some overhead and slightly different instruction mix that may be coming from compiler optimization.

Merger of #211 effectively closes this issue. We determined several small fixes and @radelja evaluated some of the key kernels to determine that functionality was similar.

More details on the analysis results will be available at https://github.com/hpcgarage/spatter/wiki/Spatter-2.0-Validation.

Closing and will open bugfix issues as needed.