NOAA-GFDL/SHiELD_build

Independent builds do necessarily not produce consistent results

Closed this issue · 15 comments

Is your question related to a problem? Please describe.

As part of a more involved development project, I am building and running SHiELD in a docker image using GNU compilers. A test I am running depends on the model consistently producing bitwise identical results for a given configuration. I am puzzlingly finding that the answers the model produces change depending on the build. Specifically they seem to flip randomly between two states.

This repository minimally illustrates my setup. It contains a Dockerfile which is used to build SHiELD using the COMPILE script in this repository, submodules for the relevant SHiELD source code, and some infrastructure for testing the model within the docker image. The README should contain all the information necessary for reproducing the issue locally (at least on a system that supports docker). The upshot is that the regression tests, which check for bit-for-bit reproducibility across builds, do not always pass.

Describe what you have tried

I am a bit stumped at this point, so my idea here was to try and distill things to a minimal reproducible example, and reach out to see if there was something obvious I am doing wrong. Is there an issue in my environment or how I am configuring the build that is leading to this problem? I am happy to provide more information where needed. I appreciate your help!

Which tests are not reproducing? Is it some of the tests in this (NOAA-GFDL/SHiELD_build) repository? Or is it a test in the https://github.com/ai2cm/SHiELD-minimal/tree/main repository? Only reason I ask is because these tests in NOAA-GFDL/SHiELD_build/RTS/CI are known to not reproduce because of the add_noise nml variable in the fv_core_nml:
d96_2k.solo.bubble
d96_2k.solo.bubble.n0
d96_2k.solo.bubble.nhK

Thanks @laurenchilutti -- it is a custom test in the SHiELD-minimal repository that I set up. The namelist parameters are defined here; add_noise is not set meaning that it takes on its default value of -1.

Thanks @lharris4, that's correct, the answers only have the potential to change when I recompile (and for that it seems like they take on the value from one of just two states). For a given executable, the model seems to produce consistent results (as evidenced by this test, which runs the executable 5 times and checks that it gets the same result).

It is surprising that a random seed would change at compile time. Is this relevant only when using specific schemes? I.e. for testing purposes should I try running in a different configuration? For example, we do not seem to have this problem in FV3GFS.

I believe the only scheme that would use the random seed is the cloud overlap scheme, although some versions of the convection also use a random seed. I do know that the random seed was set up in a way to ensure run-to-run consistency/reproducibility.

You can try a 1-timestep test (run length the same as dt_atmos) and compare restart files to get an idea where precisely the reproducibility problem appears. As to why it would only change across recompiles and not re-runs, I am not sure.

I tried the single timestep test:

  • In the fv_core.res restart files, only T, u, and v differ. Maximum absolute differences at a grid point are 2.8e-13, 1.8e-15, and 7.1e-15 respectively.
  • In the fv_tracer.res restart files, only sphum, ice_wat, and cld_amt differ. Maximum absolute differences at a grid point are 6.9e-18, 2.7e-19, and 7.8e-14, respectively.
  • In the fv_srf_wnd.res restart files, no fields differ.
  • In the sfc_data restart files, only the tprcp field differs. Maximum absolute difference at a grid point is 6.8e-19.

Differences are extremely small and appear at only a limited number of grid points.

I am currently seeing if I can reproduce this behavior (inconsistent results between clean compiles) with Intel compilers on Gaea with the exact same test case.

Through four independent builds and runs with Intel compilers on Gaea, I am not able to reproduce this issue with this exact test case (i.e. I always get the same answer), which points to an issue in the interaction between my docker environment (which I think is fairly innocuous?), the GNU compilers, and SHiELD_build.

Through four independent builds and runs with Intel compilers on Gaea, I am not able to reproduce this issue with this exact test case (i.e. I always get the same answer)

The same is true if I use GNU compilers on Gaea.

It appears that if I upgrade the GNU compilers in my container from version 11.4 to version 12.3 (more consistent with Gaea C5, which uses 12.2), I am able to obtain reproducible builds (at least through five consecutive clean builds); see ai2cm/SHiELD-minimal#4. I will keep this open until I see this confirmed in a few more build cycles, but this seems like a promising way forward.

Thanks for the update Spencer. I know at one time MSD found what we believe was a compiler bug with gcc 11.1 when testing it as part of our FMS CI. We had subsequently tested v11.3 successfully, but have no data for v11.4

Thanks @bensonr -- I have further traced this back to the fact that the version of MPICH available from the package manager in the Ubuntu LTS 22.04 image automatically uses link time optimization with its compiler wrappers. If I manually disable it by adding -fno-lto to the FFLAGS and CFLAGS, I get reproducible builds with version 11.4 GNU compilers; see ai2cm/SHiELD-minimal#6 for more details / context.

That could be old news to MSD folks, but I’m posting this here in case anyone else comes across this (the lesson seems to be: beware of link time optimization if reproducibility is important).

@spencerkclark - thanks for finding this and bringing it to our attention. I don't think we've seen this before, so I will make sure to alert the team about potential issues with GNU compiles.

Sounds good @bensonr. I am closing this issue, as I am confident now in the cause and how to work around it (currently just falling back to using an older Ubuntu LTS image, i.e. 20.04, in which MPICH does not add options related to link time optimization by default). The overriding flags are of course an option if using a newer Ubuntu LTS image.