google-deepmind/graphcast

Memory exhaustion running gencast mini demo on the recommended runtime V2-8-TPU

Closed this issue · 6 comments

Hi,

I tried running the gencast_mini_demo.ipynb on the V2-8-TPU runtime but got a memory exhaustion error at the cell block entitled "Autoregressive rollout (loop in python)". Any idea on how to get the mini demo running successfully ?

See error:

XlaRuntimeError: RESOURCE_EXHAUSTED: XLA:TPU compile permanent error. Ran out of memory in memory space hbm.

I hit the same error on the recommended V2-8-TPU runtime.

In https://github.com/google-deepmind/graphcast/blob/main/docs/cloud_vm_setup.md#gencast-memory-requirements memory requirements are described as:

1deg GenCast: ~21GB of Host RAM (System Memory) and ~8GB of HBM (vRAM)

Prior to getting the above error I see Host RAM utilization of the VM to be over an order of magnitude larger, around 240 GB, when running gencast_mini_demo.ipynb without any modifications on a V2 TPU through Google Colab. Is that expected?

Hello!

Could you please confirm which model you're running in the demo notebook?

In particular, you may notice the "Choose the model" cell allows one to load the GenCast Mini checkpoint or a random model. I wonder if you are unintentionally running a large random model and going OOM?

We've checked again on our end and GenCast Mini should run fine (as should the default random selection since this is just the specifications of GenCast Mini - 2^4 mesh and 512 latents).

Let me know,

Andrew

Hi Andrew,
I got it to run successfully using the checkpoint but if I choose to use random model selection then it runs out of memory.
See parameters attached.
Screenshot 2024-12-11 at 10 05 56 AM

Glad to hear the checkpoint ran fine!

Could you share which dataset you are loading/running the random model on? I suspect you may be running on a 0.25deg dataset which would explain the OOM as this would require more memory than 1deg.

Apologies for the constraints, the free compute is quite limited! Will send a patch soon to clarify these in the demo notebook.

Hi Andrew,

You are correct the data being selected was the 0.25 instead of the 1 degree. See attached.

The model ran successfully after choosing the 1 degree step 1. See attached.

Screenshot 2024-12-11 at 10 23 03 AM Screenshot 2024-12-11 at 10 25 11 AM

Great! Will make some changes to the notebook soon to clarify this.

Andrew