large EC-Earth data fails at CMIP6 cleaning due to OOM error

Question

large EC-Earth data fails at CMIP6 cleaning due to OOM error

emileten opened this issue 3 years ago · 3 comments

Blocks progress on #263 and #266.

Workflow : https://argo.cildc6.org/archived-workflows/default/28a83ec8-998f-4d61-96f1-73f57387d3e7

One can look at the standardize_gcm step, in cleaning : each and every retry failed due to OOM errors.

I picked one of these failed pods input and reproduced the OOM on JupyterHub with a 48GB server, which is the specified resource limit in this pod in our argo workflow.

In standardize_gcm we load the data in memory. EC-Earth3 pr is 256 * 512 * time, so higher resolution than other models, but that's still only ~16GBs for the future data. The problem is that we have operations in standardize_gcm that make the memory usage kind of blow up, I think.

Another model of this family, with a lower resolution, that is running here https://argo.cildc6.org/workflows/default/e2e-ec-earth3-veg-lr-pr-t8stn?tab=workflow&nodeId=e2e-ec-earth3-veg-lr-pr-t8stn-1904431442, nearly crashed at the same steps for the same reason, but survived thanks to retries.

Answer 1 · 2022-02-16T01:27:32.000Z

An additional detail. In standardize_cmip6, the two culprits are :

The precip unit conversion : ds_cleaned['pr'] * 24 * 60 * 60
xclim_remove_leapdays(ds_cleaned)

If we're willing to spend time on this, I see one acceptable option only. Split the standardize_cmip6 step so that argo works on a few spatial chunks. We'd also avoid changing anything to dodola. standardize_cmip6 is spatial-independent so that would be fine.

Two other options that won't work are : increasing the resource limits or restructuring standardize_cmip6 in dodola. The former won't actually work as the probem is too severe, same for the latter which on top of it implies a lot of re-write.

Note that fixing this issue would allow us to let in 4 models data from this consortium.

[Edit : updated some information and clarified]

Answer 2 · 2022-02-16T07:41:58.000Z

Oh ok. I think I understand better what happened here.

I was puzzled half an hour ago by the fact that some precip EC models had already passed the cleaning stage in the past. In fact, I moved backward the models card on the project board -- @brews it seems you had ran the precip cleaning steps for these models just fine, for example in this workflow : https://argo.cildc6.org/archived-workflows/default/769b4e04-61e5-4efc-a409-ba480909a292. Why is it complaining about memory now ? Answer below.
It was using dodola 0.8.0. dodola 0.8.0 was not loading the data in memory in dodola.services.standardize_cmip6 before running dodola.core.standardize_gcm. We introduced this as a necessary patch to be able to convert from 360-days calendars (which requires data that is not chunked across time) in this PR : ClimateImpactLab/dodola#151
We didn't realize it, but this PR broke the EC precip cleaning step.

The only step requiring the absence of temporal chunks is the 360 days calendar conversion, though. Therefore, I am suggesting we move the data loading to that specific location of the code. It's a super easy change and it fixes the backward compatibility of that breaking PR. The only downside is that it introduces chunking concerns in dodola.core. We already have some there, though...

Answer 3 · 2022-02-17T01:06:01.000Z

Like I expected, two additional EC-Earth models failed due to this (EC-Earth3-AerChem and EC-Earth3-CC)