MPAS-Dev/MPAS-Analysis

Climatology computation (with BGC) on Anvil has OOM error

Closed this issue · 9 comments

I am running MPAS-Analysis for ocean bgc simulations (for the mesh EC30to60E2r2) and it keeps giving me an error.
The configuration is somewhat standard, including climatology and global time series plot.

From the log I see, it is causing an ERROR in "mpasClimatologyOceanAvg", more specifically "ncclimo" from my understanding. I am not sure why this keep failing since the same configuration worked before on different machine when I tested. Here is an output from MPAS-Analysis,
https://web.lcrc.anl.gov/public/e3sm/diagnostic_output/ac.ytakano/output_EC30to60E2r2_chrysalis-bgc1-piCTL_02182022-01/
(not complete because of an error).

I will also attach the log file here. Please let me know if further information is needed. Thank you.
mpasClimatologyOceanAvg.log

It's failing in the annual average. Is this the first time that you compute climos over 18 years (years 41-58)? I wonder if it's a memory problem. It doesn't explicitly say so, I think, just that ncra failed.

maybe you could try with parallelTaskCount = 12 and see what happens?

xylar commented

The text Killed in the log file indicates that the job ran out of memory. That is a common reason things work differently on different machines. Anvil nodes have much less memory than Chrysalis nodes.

Unfortunately, @milenaveneziani's suggestion of running with parallelTaskCount = 12 won't help here because there is a different config option for that. Please try adding this section (if missing) and option to your config file:

[execute]
## options related to executing parallel tasks

...

# the number of total threads to use when ncclimo runs in "bck" or "mpi" mode.
# Reduce this number if ncclimo is crashing (maybe because it is out of memory).
# The number of threads must be a factor of 12 (1, 2, 3, 4, 6 or 12).
ncclimoThreads = 6

Please run with the --purge flag to get a fresh start (no much analysis happened anyway). If you still see Killed in the log file, please try 4, then 3, then 2 threads. I don't believe that will be necessary.

xylar commented

@ytakano3, I can see your config file from the incomplete web page. Here's what you need:

[execute]
## options related to executing parallel tasks

# the number of parallel tasks (1 means tasks run in serial, the default)
parallelTaskCount = 12

# the parallelism mode in ncclimo ("serial" or "bck")
# Set this to "bck" (background parallelism) if running on a machine that can
# handle 12 simultaneous processes, one for each monthly climatology.
ncclimoParallelMode = bck

# the number of total threads to use when ncclimo runs in "bck" or "mpi" mode.
# Reduce this number if ncclimo is crashing (maybe because it is out of memory).
# The number of threads must be a factor of 12 (1, 2, 3, 4, 6 or 12).
ncclimoThreads = 6

# "None" if ESMF should perform mapping file generation in serial without a
# command, or one of "srun" or "mpirun" if it should be run in parallel (or ins
# serial but with a command)
mapParallelExec = srun

# "None" if ncremap should perform remapping without a command, or "srun"
# possibly with some flags if it should be run with that command
ncremapParallelExec = srun

I would continue to use 12 parallel tasks unless other operations besides ncclimo run out of memory.

xylar commented

Unrelated to this error, but please consider using -m anvil when you call MPAS-Analysis so you don't need some of these config options. See: https://github.com/MPAS-Dev/MPAS-Analysis/blob/develop/example_e3sm.cfg

With this approach, you would not need the other config options in the [execute] section besides ncclimoThreads and parallelTaskCount because they should be the machine defaults. Likewise, the [diagnostcs] section would not be needed.

With this approach, you can also use these "variables"

# put files in the Anvil web portal*
htmlSubdirectory = ${web_portal:base_path}/${web_portal:username}/output_EC30to60E2r2_chrysalis-bgc1-piCTL_02182022-01

With this approach, your config file will be more portable between machines.

Unrelated to this error, but please consider using -m anvil

ah, this is the newer capability that still hasn't stuck in my brain! thanks for the reminder @xylar.

Thank you @xylar @milenaveneziani for suggestions and sharing information. Great tips and I should have been aware of the memory dependence on machines. I will try again by adding settings in execute (and yes purge).

xylar commented

@ytakano3 was able to confirm that fewer ncclimo tasks on Anvil took care of this. I will reduce the default number of tasks on Anvil before closing this.

@xylar Thank you for following up and my apologies on delay in updates. Yes fewer ncclimo tasks on Anvis took care of my issue and thank you for changing the topic title. I am not sure if this still crash without BGC but certainly do with BGC.