Issues with GraphCast Training – Request for Assistance

Question

Issues with GraphCast Training – Request for Assistance

zlminmin opened this issue 7 months ago · 3 comments

Dear Professor,

I hope this message finds you well.

I am a postgraduate student and a beginner in deep learning. I have read your article on GraphCast and would like to retrain a GraphCast model based on your source code. I have provided a portion of my code below:

Setting model_config and task_config, along with train_steps and eval_steps.

Here is the main training loop, modified based on the source code provided in the article (I am unsure if it's correct), training for 10 epochs.

The loss and gradients for the 10 training iterations are as follows:

I have encountered several issues during training and would like to seek your guidance:

How are the steps of the dataset divided during training? If training is done in single steps, the first round inputs three time steps, with data from two time steps used for training and one time step for evaluation. For the second round of training, is the training data re-input each time? In other words, should the entire dataset be divided and read in before training, or should it be read in batches?
Regarding the training process implemented in my code, I feel that the training process is not truly effective. Despite using random initialization, the loss and gradients computed in the first round are already very low. Additionally, after 10 rounds of training, the values of loss and gradients have not significantly changed. Could you please provide some guidance on this point?
Concerning the use of the autoregressive.py file during training, I understand that it is used for single-step prediction, but I am unclear on how to use it in training. If you could provide a simple example for training, it would be greatly appreciated.

I look forward to your response. Thank you.

Best regards！

Answer 1 · 2024-07-17T07:52:46.000Z

Hello, I have some questions about model training. Have you tried training models with different resolutions, GraphCast_small (13levels, 1°) and GraphCast (37levels, 0.25°)? How much time and memory does it take to train these two models?

I look forward to your response. Thank you.

Best regards！

Answer 2 · 2024-07-17T09:38:52.000Z

@zhongmengyi I have replied in your separate issue #77

Answer 3 · 2024-07-17T09:44:04.000Z

I have encountered several issues during training and would like to seek your guidance:

@zlminmin unfortunately we cannot really provide much support for training as optimal training is a bit tied to internal infrastructure and so the scope for this initial open sourcing was mainly focused on supporting inference. But let me try to answer your questions.

Exactly for the first step of training on 1 autoregressive steps we sample trajectories with 3 steps and call "data_utils.extract_inputs_targets_forcings" on those. And then we increase that sequence length by one, every time we want to increase the length of the training target.
Unfortunately I cannot comment on how fast it should go down, however I have noticed that your train function does not actually return the new params to be used by the next iteration, so parhaps that is the issue.
During training you simply feed targets and forcings that have more than one future step on them, and call the loss method.