large EC-Earth data fails at CMIP6 cleaning due to OOM error
emileten opened this issue · 3 comments
Blocks progress on #263 and #266.
Workflow : https://argo.cildc6.org/archived-workflows/default/28a83ec8-998f-4d61-96f1-73f57387d3e7
One can look at the standardize_gcm
step, in cleaning : each and every retry failed due to OOM errors.
I picked one of these failed pods input and reproduced the OOM on JupyterHub with a 48GB server, which is the specified resource limit in this pod in our argo workflow.
In standardize_gcm
we load the data in memory. EC-Earth3
pr is 256 * 512 * time, so higher resolution than other models, but that's still only ~16GBs for the future data. The problem is that we have operations in standardize_gcm
that make the memory usage kind of blow up, I think.
Another model of this family, with a lower resolution, that is running here https://argo.cildc6.org/workflows/default/e2e-ec-earth3-veg-lr-pr-t8stn?tab=workflow&nodeId=e2e-ec-earth3-veg-lr-pr-t8stn-1904431442
, nearly crashed at the same steps for the same reason, but survived thanks to retries.
An additional detail. In standardize_cmip6
, the two culprits are :
- The precip unit conversion :
ds_cleaned['pr'] * 24 * 60 * 60
xclim_remove_leapdays(ds_cleaned)
If we're willing to spend time on this, I see one acceptable option only. Split the standardize_cmip6
step so that argo works on a few spatial chunks. We'd also avoid changing anything to dodola
. standardize_cmip6
is spatial-independent so that would be fine.
Two other options that won't work are : increasing the resource limits or restructuring standardize_cmip6
in dodola
. The former won't actually work as the probem is too severe, same for the latter which on top of it implies a lot of re-write.
Note that fixing this issue would allow us to let in 4 models data from this consortium.
[Edit : updated some information and clarified]
Oh ok. I think I understand better what happened here.
- I was puzzled half an hour ago by the fact that some precip EC models had already passed the cleaning stage in the past. In fact, I moved backward the models card on the project board -- @brews it seems you had ran the precip cleaning steps for these models just fine, for example in this workflow : https://argo.cildc6.org/archived-workflows/default/769b4e04-61e5-4efc-a409-ba480909a292. Why is it complaining about memory now ? Answer below.
- It was using dodola 0.8.0. dodola 0.8.0 was not loading the data in memory in
dodola.services.standardize_cmip6
before runningdodola.core.standardize_gcm
. We introduced this as a necessary patch to be able to convert from 360-days calendars (which requires data that is not chunked across time) in this PR : ClimateImpactLab/dodola#151 - We didn't realize it, but this PR broke the EC precip cleaning step.
The only step requiring the absence of temporal chunks is the 360 days calendar conversion, though. Therefore, I am suggesting we move the data loading to that specific location of the code. It's a super easy change and it fixes the backward compatibility of that breaking PR. The only downside is that it introduces chunking concerns in dodola.core
. We already have some there, though...
Like I expected, two additional EC-Earth
models failed due to this (EC-Earth3-AerChem
and EC-Earth3-CC
)