Climatology computation (with BGC) on Anvil has OOM error
Closed this issue · 9 comments
I am running MPAS-Analysis for ocean bgc simulations (for the mesh EC30to60E2r2) and it keeps giving me an error.
The configuration is somewhat standard, including climatology and global time series plot.
From the log I see, it is causing an ERROR in "mpasClimatologyOceanAvg", more specifically "ncclimo" from my understanding. I am not sure why this keep failing since the same configuration worked before on different machine when I tested. Here is an output from MPAS-Analysis,
https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.ytakano/output_EC30to60E2r2_chrysalis-bgc1-piCTL_02182022-01/
(not complete because of an error).
I will also attach the log file here. Please let me know if further information is needed. Thank you.
mpasClimatologyOceanAvg.log
It's failing in the annual average. Is this the first time that you compute climos over 18 years (years 41-58)? I wonder if it's a memory problem. It doesn't explicitly say so, I think, just that ncra
failed.
maybe you could try with parallelTaskCount = 12
and see what happens?
The text Killed
in the log file indicates that the job ran out of memory. That is a common reason things work differently on different machines. Anvil nodes have much less memory than Chrysalis nodes.
Unfortunately, @milenaveneziani's suggestion of running with parallelTaskCount = 12
won't help here because there is a different config option for that. Please try adding this section (if missing) and option to your config file:
[execute]
## options related to executing parallel tasks
...
# the number of total threads to use when ncclimo runs in "bck" or "mpi" mode.
# Reduce this number if ncclimo is crashing (maybe because it is out of memory).
# The number of threads must be a factor of 12 (1, 2, 3, 4, 6 or 12).
ncclimoThreads = 6
Please run with the --purge
flag to get a fresh start (no much analysis happened anyway). If you still see Killed
in the log file, please try 4, then 3, then 2 threads. I don't believe that will be necessary.
@ytakano3, I can see your config file from the incomplete web page. Here's what you need:
[execute]
## options related to executing parallel tasks
# the number of parallel tasks (1 means tasks run in serial, the default)
parallelTaskCount = 12
# the parallelism mode in ncclimo ("serial" or "bck")
# Set this to "bck" (background parallelism) if running on a machine that can
# handle 12 simultaneous processes, one for each monthly climatology.
ncclimoParallelMode = bck
# the number of total threads to use when ncclimo runs in "bck" or "mpi" mode.
# Reduce this number if ncclimo is crashing (maybe because it is out of memory).
# The number of threads must be a factor of 12 (1, 2, 3, 4, 6 or 12).
ncclimoThreads = 6
# "None" if ESMF should perform mapping file generation in serial without a
# command, or one of "srun" or "mpirun" if it should be run in parallel (or ins
# serial but with a command)
mapParallelExec = srun
# "None" if ncremap should perform remapping without a command, or "srun"
# possibly with some flags if it should be run with that command
ncremapParallelExec = srun
I would continue to use 12 parallel tasks unless other operations besides ncclimo
run out of memory.
Unrelated to this error, but please consider using -m anvil
when you call MPAS-Analysis so you don't need some of these config options. See: https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/example_e3sm.cfg
With this approach, you would not need the other config options in the [execute]
section besides ncclimoThreads
and parallelTaskCount
because they should be the machine defaults. Likewise, the [diagnostcs]
section would not be needed.
With this approach, you can also use these "variables"
# put files in the Anvil web portal*
htmlSubdirectory = ${web_portal:base_path}/${web_portal:username}/output_EC30to60E2r2_chrysalis-bgc1-piCTL_02182022-01
With this approach, your config file will be more portable between machines.
Unrelated to this error, but please consider using -m anvil
ah, this is the newer capability that still hasn't stuck in my brain! thanks for the reminder @xylar.
Thank you @xylar @milenaveneziani for suggestions and sharing information. Great tips and I should have been aware of the memory dependence on machines. I will try again by adding settings in execute (and yes purge).