writing native grid atmf history files is too slow in FV3ATM
Closed this issue · 12 comments
Description
The G-W gdas fcst job slows down significantly when the option of writing the native grid history files is turned on. Besides the resources issue on write grid component, it is also found that the model writes native grid atmf history files significantly slower than writing the Gaussian grid atmf history files or writing the native grid restart files. The timing from Dave's test is showing below:
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./atmf003.nc write time is 18.91891 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./cubed_sphere_grid_atmf003.nc write time is 184.79446 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./cubed_sphere_grid_sfcf003.nc write time is 36.00565 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: ./sfcf003.nc write time is 36.36828 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.fv_core.res.nc write time is 5.30265 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.fv_srf_wnd.res.nc write time is 0.01886 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.fv_tracer.res.nc write time is 7.70513 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.phy_data.nc write time is 7.28120 at fcst 03:00
nid002370.dogwood.wcoss2.ncep.noaa.gov 2544: RESTART/20211220.210000.sfc_data.nc write time is 3.23882 at fcst 03:00
To Reproduce:
Additional context
Output
@DavidHuber-NOAA provided a HR4 gdasfcst test case on dogwood at:
/lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_w_native_rundir
I noticed that in the above run directory (/lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_w_native_rundir) in model_configure lossy compression parameters (quantization) are set as:
$ grep quantize /lfs/h2/emc/global/noscrub/David.Huber/keep/gdasfcst_w_native_rundir/model_configure
quantize_mode: 'quantize_bitround'
quantize_nsd: 5
quantize_nsd
parameter for 'quantize_bitround' mode specifies the number of significant bits (5 in this case). 5 bits is very low and probably not enough for fields like temperature. This has nothing to do with the native grid file write time, but I just wanted to see if this is really intended.
@aerorahul the quantize_nsd
and quantize_bitround
fields were updated at NOAA-EMC/global-workflow@386ce38. Just checking if 5 digits is enough for our needs.
@junwang-noaa and @aerorahul had a conversation on what these should be.
If we need more fine-grain control based on resolution/run, we can. Just let us know what those values should be.
@DusanJovic-NOAA the quantize_nsd and quantize_bitround configurations are corresponding to the previous nbits=14 with our customized lossy compression code. The physics group evaluated with results for nbits setting from (nbits=12-32), and decided the nbits=14 to be used in GFSv16. The quantize_nsd=5 is corresponding to nbits=14.
@DavidHuber-NOAA Can you sync the input data for the test case you provided on Cactus. I see these errors:
72.1536.grb
FATAL ERROR: in opening file
/lfs/h2/emc/global/noscrub/David.Huber/GW/develop/fix/am/global_slmask.t1534.30
72.1536.grb
FATAL ERROR: in opening file
/lfs/h2/emc/global/noscrub/David.Huber/GW/develop/fix/am/global_slmask.t1534.30
72.1536.grb
@DusanJovic-NOAA This test case was run on Dogwood and I do not have access to it now that it is in production. However, I just created a fresh clone into develop
. Let me know if that works for you. If not, I will rerun the case on Cactus.
@DusanJovic-NOAA This test case was run on Dogwood and I do not have access to it now that it is in production. However, I just created a fresh clone into
develop
. Let me know if that works for you. If not, I will rerun the case on Cactus.
Thanks. It works, but I had to change the directory names in input.nml.
ls: cannot access '/lfs/h2/emc/global/noscrub/David.Huber/GW/develop/fix/am/global_slmask.t1534.3072.1536.grb': No such file or directory
but the one with 'david.huber` does exist.
I found that native history files write is noticeably faster if I change the size of the chunks, specifically:
diff --git a/io/module_write_netcdf.F90 b/io/module_write_netcdf.F90
index b016415..03a9d57 100644
--- a/io/module_write_netcdf.F90
+++ b/io/module_write_netcdf.F90
@@ -398,14 +398,14 @@ contains
par_access = NF90_COLLECTIVE
if (rank == 2 .and. ichunk2d(grid_id) > 0 .and. jchunk2d(grid_id) > 0) then
if (is_cubed_sphere) then
- chunksizes = [im, jm, tileCount, 1]
+ chunksizes = [im, jm, 1, 1]
else
chunksizes = [ichunk2d(grid_id), jchunk2d(grid_id), 1]
end if
ncerr = nf90_def_var_chunking(ncid, varids(i), NF90_CHUNKED, chunksizes) ; NC_ERR_STOP(ncerr)
else if (rank == 3 .and. ichunk3d(grid_id) > 0 .and. jchunk3d(grid_id) > 0 .and. kchunk3d(grid_id) > 0) then
if (is_cubed_sphere) then
- chunksizes = [im, jm, lm, tileCount, 1]
+ chunksizes = [im, jm, 1, 1, 1]
else
chunksizes = [ichunk3d(grid_id), jchunk3d(grid_id), min(kchunk3d(grid_id),fldlev(i)), 1]
end if
Can apply this change in the code, recompile, and rerun your test.
@DusanJovic-NOAA Thanks for the quick attention on this. I gave your code changes a try and ran a fresh forecast with native grid writes enabled at C768. This significantly reduce the runtime from ~60 minutes to ~23 minutes. I copied the run directory into /lfs/h2/emc/global/noscrub/david.huber/keep/gdasfcst_fast_native
and the log file can be found here: /lfs/h2/emc/global/noscrub/david.huber/para/COMROOT/fix_slow_writes/logs/2021122018/gdasfcst_seg0.log
.
@DavidHuber-NOAA Thank you for checking. @junwang-noaa should we update the code in develop with these changes?
@DusanJovic-NOAA Thanks for debugging the issue. The timing looks good now. Please update the develop branch.