Excess pass_var calls in ZB2020

Question

Excess pass_var calls in ZB2020

Closed this issue 9 months ago · 4 comments

The recently merged ZB2020 implementation is currently usable but appears to suffer from performance issues. The following changes have been suggested:

Halo updates applied for individual 2D layers could be deferred and applied to the full 3D field.
There are instances of halo updates applied before and after a computation. The halo should account for previous computation and only one should be required.
Many individual halo updates could be bundled into a do_group_pass.
Expensive collective min_max tests for monotonicity may be better suited under a debug-like flag (either the global MOM debug flag or a ZB2020-specific flag.
CPU clocks around calls to ZB2020 would be useful for diagnosing future issues.

These are discussed in detail in #356.

Answer 1 · 2023-09-15T16:51:25.000Z

Hi @Hallberg-NOAA , @adcroft.

I am working on accelerating and refactoring my code. My code is probably too complicated. While simplicity is preferable for acceleration.

During the implementation of the ZB model, I made a pretty strange choice. I used the mask of outcropped points as a land mask for setting boundary conditions in filters. So, while I perform many iterations of the filter, the predicted subgrid stress does not spread to the outcropped regions. While this decision was somewhat natural to me a year ago, today I am not sure if it was a good choice. It is difficult to explain any physical/mathematical or numerical profit out of it. For example, filters in GME use a simple land mask and the rest of the MOM6 code in most cases does not use any mask for stencil operations (for example, for interpolation). What do you both think, should I keep this feature or I should use GME filters instead?

There are additional complicated parts: high-order hypervisosity, and highly scale-selective filters. While I was using these features in early stages of my research, currently I do not use them. So, for simplicity of accelerating the code, these features will be removed.

Answer 2 · 2023-09-15T17:25:58.000Z

Removing unused features is a good strategy. If you ever need to recover the code, it will still be in the history.

Answer 3 · 2023-09-24T05:22:22.000Z

Hi @adcroft @marshallward

In an experimental branch I prepared a faster code implementation. It is 4 times faster on a single core compared to the previous code and orders of magnitude faster on many cores. Example of runtime in NW2 at 1/2-degree resolution:

Tabulating mpp_clock statistics across   1000 PEs...
                                          tmin          tmax          tavg          tstd  tfrac grain pemin pemax
Total runtime                        23.794106     23.795215     23.794666      0.000196  1.000     0     0   999
...
Ocean dynamics                       21.111240     21.122164     21.116299      0.001721  0.887    11     0   999
...
(Ocean Zanna-Bolton-2020)             0.208538      0.424202      0.300429      0.035180  0.013    31     0   999
(ZB2020 compute stress)               0.011057      0.036033      0.017592      0.004430  0.001    41     0   999
(ZB2020 compute divergence)           0.019527      0.044737      0.027826      0.004378  0.001    41     0   999
(ZB2020 filter MPI exchanges)         0.066572      0.200020      0.127617      0.021220  0.005    41     0   999
(ZB2020 filter no MPI)                0.081012      0.109340      0.099712      0.005854  0.004    41     0   999

Optimization of filters includes marching halo with non-blocking grouped MPI exchanges and implementation of the filter with minimum amount operations (tensor product of 1D filters, reducing amount of multiplications).

I wondering, can we try this branch for a couple of model days in OM4 model and if we happy with the performance I will prepare a formal PR. The optimal namelist parameters in NW2:

#override USE_ZB2020 = True
#override ZB_SCALING = 2.5
#override STRESS_SMOOTH_PASS = 4
#override ZB_KLOWER_R_DISS = 1.0
#override ZB_KLOWER_SHEAR = 1

Measurement of runtime works with clock_grain='ROUTINE'.

UPD. See Pull Request for newest code.

Answer 4 · 2023-10-23T20:00:37.000Z

Fixed by #484