[BUG]: frac_grid input in non-frac mode is not restart reproducible in coupled model at c96,c192 or c384
Closed this issue · 45 comments
Describe the bug
-
When using the new frac_grid input in non-frac mode, FV3 does not restart repro for up to 8 fields.
-
When using the original FV3_input_data384, FV3 does restart repro.
To Reproduce
Checkout this branch: https://github.com/DeniseWorthen/ufs-s2s-model/tree/feature/restart
-
This branch has restart tests at c96mx100, c192mx050 and c384mx025 added, each using a 1d run, a 12h run and then a restart from the 12h run.
-
The nems.configure has an added mediator history write phase added at each timestep.
-
The rt.conf has only the 3 restart for each resolution active.
-
When any one resolution is run, first the 1d and 12h runs complete. The restart test starts after the 12h run completes, using the restart files produced in that run. Comparisons should then be made between the just completed 1d run and the restart run (not the baseline!).
-
For any given resolution, compare the mediator history file at the first timestep of the restart run against the same timestep of the continuous run, eg:
cprnc -m s2s_control_c384/RESTART/ufs.s2s.cpl.hi.2016-10-03-43650.nc s2s_restart_c384/RESTART/ufs.s2s.cpl.hi.2016-10-03-43650.nc |grep RMS
produces:
57: RMS atmImp_Faxa_rain 1.2255E-09 NORMALIZED 3.5022E-05
73: RMS atmImp_Faxa_snow 8.4771E-11 NORMALIZED 3.2261E-05
82: RMS atmImp_Faxa_swndf 6.0959E-02 NORMALIZED 1.7366E-03
98: RMS atmImp_Faxa_swvdf 1.3133E-01 NORMALIZED 2.6185E-03
128: RMS atmImp_Sa_pbot 5.4215E-06 NORMALIZED 5.5112E-11
144: RMS atmImp_Sa_shum 3.6325E-10 NORMALIZED 3.9001E-08
153: RMS atmImp_Sa_tbot 1.7967E-05 NORMALIZED 6.2125E-08
176: RMS atmImp_Sa_z 1.4080E-06 NORMALIZED 6.1957E-08
These are the 8 fields which are not the same the first time FV3 sends fields to the mediator after restart.
-
Since the mediator history files contain the atm fields on the mesh, an additional step is required to put them onto a tile grid to view. This ncl script can be used to transfer the mesh fields to the tile for any resolution. The user must point to their own run directory, called
RT
in this script. The tool will produce files such asufs.s2s.cpl.hi.43650.tile3.nc
which can be compared between the 1d and restart runs. -
To show that the original FV3_input_data384 does reproduce, uncomment the line
#export FRAC_GRID_INPUT='.F.'
in the s2s c384 tests (in tests/tests).
- A copy of cprnc is located here: /scratch1/NCEPDEV/stmp2/Denise.Worthen/cprnc
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here. Directly reference any issues or PRs in this or other repositories that this is related to, nd describe how they are related.
To clarify:
The existing restart test in ufs-s2s uses c96mx025 (the current default resolution). Originally when we committed CMEPS we wanted to add the restart test for c384mx025 but found an issue on lake_frac points (see Issues ufs-community/ufs-s2s-model#34 and ufs-community/ufs-s2s-model#108). So we switched the restart test to c96mx025.
I have not tested the c96mx025 resolution in this branch since we are not carrying it forward as a tested resolution.
Testing this branch with C384mx025 resolution using the original input data in FV3_input_data384 does reproduce. It does not reproduce in my testing using the c384mx025 input in FV3_input_frac.
I've made this assignment to myself for tracking but I will require assistance from @shansun6
Hi Denise,
I am able to reproduce at C384mx025 resolution using the original input data, but fail this test when running with frac input. Since it takes a while to run C384, I am starting a test using C96 and with FV3 only without other submodules, as I think the same problem would occur in the FV3 standalone model. How do you like this approach?
Thanks,
Shan
I found one bug that prevented the coupled model from restart reproducible: lake ice needs to be saved in the restart file. After fixing this by modifying GFS_surface_composites.F90 & GFS_surface_composites.meta, C96mx025 and C384mx025 can reproduce after restart. The updated code is at https://github.com/shansun6/ufs-s2s-model/tree/bugfix/denise_restart_lakeice, which is based on 64eeba7 of https://github.com/DeniseWorthen/ufs-s2s-model/tree/feature/restart/.
However, with this revised code, it still fails at C96mx100 and C192mx050 when restarting at 12h (43200s). Results at 44100s are identical but differences appeared on every tiles at 45000s for both C96mx100 and X192mx050, suggesting it has something to do with coupling with ocean, as atm and ice use 900s time step, and the ocean's time step is 1800s. Attached is a plot of difference in atmlmp_Sa_z at 45000s on the 6 tiles. The fact that differences occurred along lines may indicate something in atm and ocean interface layout? Also coupling with the ocean resolution at mx025 are well tested, while both mx100 and mx050 configurations are new. For more info, see results of all 12 regression tests at /scratch2/BMC/gsd-fv3-dev/Shan.Sun/S2S_RT/rt_157734/.
Thanks Shan. We've seen this pattern before in the ocean restart. Basically we're seeing the pattern of the ocean grid decomposition with differences arising from the halos. Let me look into it more since we have the right parameter options set for mx025. It is possible that the lower resolution ocean models need different or additional settings.
I've been able to get restart repro for all the test cases by cherry picking an update to MOM6 that is upcoming. It is the commit for Add halo updates needed with VERTEX_SHEAR=True. Substituting a MOM6 branch with that update added to our current emc/develop worked.
I did also try setting some of the MOM_input parameter settings for various bugs to their 'non-bug' setting since GFDL has them set for their own reproducibility testing but that had no impact.
Further testing has shown that the setting of USE_LA_LI2016=False
is still required to obtain restart repro even when the above MOM6 branch is used.
@shansun6 Just as a heads-up, I will be moving the restart testing branch to my fork on ufs-weather.
The branch I am using for testing is now on my ufs-weather fork feature/cpld_restart.
This branch has FV3 updated to the current ufs-weather b955f81.
When using a MOM6 branch with the vertex halo fix and USE_LA_LI2016=False
, I am now getting restart repro with all tested resolutions using frac_grid input in non-frac mode.
When testing the same branch in frac_grid mode (FRAC_GRID=.T.
and CPLMODE=nems_frac
) I find that FV3 does not reproduce on restart with the same small number of points (a dozen or so) for the same 8 fields on tile3 as previously noted.
When testing using PR #238 in non-frac mode, the model fails with saturation vapor pressure errors.
Using 64276ef6 from https://github.com/junwang-noaa/fv3atm/tree/ccppipd the model successfully runs. I only checked c96mx100 but it reproduces.
In frac grid mode, the atm doesn't reproduce on any tile.
I am using 24h/12h/12h->24h in the restart test.
I can set something like that up, but which exact case would you want me to test? The original c96mx025? That would test this FV3 branch in non-frac mode using the old (non-lake) input c96 input.
OK, let me set that up.
I was testing global restart, it reproduces control for 24/12/12->24. but not 48/24/24->48. I'd like to check if s2s restart has the same issue. c96mx025 is OK, I am running with non-frac mode using old oro data(no lake vars)
@junwang-noaa That must be why utest passes since it tests 24/12/12->24
This is the directory w/ the 2d/3d/1d mx025 restart test using FV3 @b955f810 :
/scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/restart_2d3d
The mediator restart for the restart test at the end of the run is identical to the continuous 3d run restart.
I am currently testing the cpld_restart branch (41586b7). This has FV3 @ develop + the MOM6 halo fix for vertex shear (PR 1221).
Using frac_grid input in non-frac mode, I get restart repro at c96mx100, c192mx050, c384mx025 using the 1d/12h/12h test.
Testing the same configuration code in frac_grid mode, none of the cases give restart repro. The fields exported by FV3 at the first timestep after restarting are different than the continuous run.
Digging further, for the c96mx100 case, only tile3 and tile2 don't reproduce in the frac grid mode. Tile2 doesn't reproduce on two grid points. Both of these tile 2 points have land_frac=1.0. They are (i=59,j=80) and (i=46,j=93). The fields which don't reproduce are:
atmImp_Faxa_lwdn
atmImp_Faxa_lwnet
atmImp_Sa_pbot
atmImp_Sa_shum
atmImp_Sa_tbot
atmImp_Sa_z
The feature/cpld_restart branch in my fork is set up to output the cpl history files at every timestep. The control is by history_n=1
and history_option=nsteps
.
The baselines pass because I had to comment out all the comparison files to get the ufs-weather branch to run the dep_run test using ecflow. I think the rt.sh for ufs-weather has an error control that if the control test fails, it does not run the dep_run. Since I'm not comparing against baselines for the restart test I don't care if the baseline fails. The only way I could get ufs-weather to run the dep_run test is to comment out the LIST_FILES.
The history files you're looking for are in the RESTART directory. The history files have the name ufs.cpld.cpl.hi
. The restart files are the ufs.cpld.cpl.r
.
Does 41586b7 point to the right FV3 & ccpp/physics submodule, or should I switch to the right submodules? I noticed ccpp/physics uses Oct. 9 version and doesn't have the fix I put in on Oct. 17?
Thanks,
Shan
Hi Denise,
I am able to get restart reproducible for all tests in rt.cpldrestart.conf, after I switched FV3 submodule to https://github.com/shansun6/fv3atm/ -b fix_frac_rst_20201114 where I made 2 changes, including one from Moorthi. See if you get the same. Thanks, Moorthi!
Shan
Hi Shan,
I tried to run with frac_grid =T in the cpld_restart branch. I got compute_qs: saturation vapor pressure table overflow, nbad= 1
My run is here : /scratch1/NCEPDEV/stmp2/Denise.Worthen/FV3_RT/rt_36433/cpld_control_prod
I added the changes from your fv3atm branch. Are there other changes I need?
It is here: /scratch2/NCEPDEV/climate/Denise.Worthen/WORK/ufs_restart
It has an updated CICE and some changes I'm working on in CMEPS for normalizing the fluxes coming back from the atm. There is an extra field exported by ATM but nothing is being done w/ it.
Hi Shan,
I put in your revision to fv3 and I am now getting restart repro for c96,c192,c384 in frac_grid mode.