NOAA-GFDL/MOM6

Reduced performance of `PressureForce_FV` after 3D array promotion

Opened this issue · 0 comments

PR #629 showed a reduced performance of the pressure force of about ~5% after many of its arrays (pa, intx_*, etc) were promoted from 2D to 3D. Test was benchmark on Gaea for a short 12 hour run.

Intel -g -O2:

[c5n1513:ocean_only/benchmark]$ for i in $(seq 5); do ../build/MOM6 2> /dev/null | grep "pressure force" ; done
(Ocean pressure force)                   8      1.126123
(Ocean pressure force)                   8      1.130228
(Ocean pressure force)                   8      1.137188
(Ocean pressure force)                   8      1.131168
(Ocean pressure force)                   8      1.137177

[c5n1513:ocean_only/benchmark]$ for i in $(seq 5); do ../pr629/MOM6 2> /dev/null | grep "pressure force" ; done
(Ocean pressure force)                   8      1.180546
(Ocean pressure force)                   8      1.185673
(Ocean pressure force)                   8      1.185738
(Ocean pressure force)                   8      1.184458
(Ocean pressure force)                   8      1.184249

GCC -g -O2

[c5n1507:ocean_only/benchmark]$ for i in $(seq 5); do ../gcc_build/MOM6 2> /dev/null | grep "pressure force" ; done
(Ocean pressure force)                   8      1.445984
(Ocean pressure force)                   8      1.451641
(Ocean pressure force)                   8      1.456584
(Ocean pressure force)                   8      1.461087
(Ocean pressure force)                   8      1.450836

[c5n1507:ocean_only/benchmark]$ for i in $(seq 5); do ../gcc_pr629/MOM6 2> /dev/null | grep "pressure force" ; done
(Ocean pressure force)                   8      1.548566
(Ocean pressure force)                   8      1.539079
(Ocean pressure force)                   8      1.539607
(Ocean pressure force)                   8      1.540305
(Ocean pressure force)                   8      1.542846

Differences in timing were much higher on low-spec machines; my work laptop saw a 20% slowdown. But this does not seem to be a problem when either the CPU has sufficient cache or the machine has sufficient RAM.

Profiling did not show any major differences in bytecode, so it is probably more related to the indexing and moving of memory. Somewhat backed up by higher sampling times in the movaps instructions. But it is a bit early to attribute any particular cause. The main point of this is to document the problem, in case we come back to find a way to restore the old performance.