Resume training error

Question

Resume training error

Closed this issue a year ago · 12 comments

Hi, magehrig

sorry for the bother, I have met some problems when resuming the training.

I trained the model with three GPUs, and I set the config of wandb as :

wandb_runpath: zhang20010218/RVT/ja1260m8 # WandB run path. E.g. USERNAME/PROJECTNAME/1grv5kg6
artifact_name: zhang20010218/RVT/checkpoint-ja1260m8-topK:v1 # Name of checkpoint/artifact. Required for resuming. E.g. USERNAME/PROJECTNAME/checkpoint-1grv5kg6-last:v15
artifact_local_file: RVT/ja1260m8/checkpoints/last_epoch=000-step=100000.ckpt # If specified, will use the provided local filepath instead of downloading it. Required if resuming with DDP.
resume_only_weights: False
group_name: version1.0 # Specify group name of the run
project_name: RVT

Other than that, I didn't make any changes.

However, I met the error like that:
File "/home/zht/Python_project/RVT/loggers/wandb_logger.py", line 218, in after_save_checkpoint
self._scan_and_log_checkpoints(checkpoint_callback, self._save_last and not self._save_last_only_final)
File "/home/zht/Python_project/RVT/loggers/wandb_logger.py", line 321, in _scan_and_log_checkpoints
self._rm_but_top_k(checkpoint_callback.save_top_k)
File "/home/zht/Python_project/RVT/loggers/wandb_logger.py", line 343, in _rm_but_top_k
score = artifact.metadata['score']
KeyError: 'score'

Do you know what's going on, is there something wrong with my config setting?

Answer 1 · 2023-05-12T16:32:51.000Z

The config you have seems to be correct for resuming the run. I have never encountered this error. It appears that the score (which is the validation score) is missing in the metadata. Can you check (e.g. on wandb) if the logged artifact has the "score" in the metadata?

I am starting a run myself with this code to see if I can reproduce the issue. Can you confirm that you followed the installation instructions? For that purpose please post the output of conda list here.

Btw, I think it's also a bit strange that you have ../checkpoint-ja1260m8-topK:v1 as artifact_name but ../last_epoch=000-step=100000.ckpt as artifact_local_file. Are you validating every 50k steps?

Answer 2 · 2023-05-12T22:53:44.000Z

could replicate the issue and will look into it in the next few days

Answer 3 · 2023-05-13T02:21:36.000Z

Thank you for your response!

I have used the same Python environment as you (installed exactly as instructed).

Also, it is true that I evaluate every 50 steps. Did this operation cause my error? If I evaluate every 10k steps, will the error not occur again?

You mentioned that you have reproduced the error, so I will not provide the output of conda list or check for the "score" in wandb for now (since I am a beginner with wandb and am not sure how to use it yet). If you need me to describe any of my settings or environment, please let me know.

Answer 4 · 2023-05-13T15:04:12.000Z

Can you try again? It works for me now.

So there is a combination of weird stuff happening.

Pytorch Lightning executes the model checkpoint callback which attempts to save the latest checkpoint, although we are actually resuming. It may be related to this PL issue: Lightning-AI/pytorch-lightning#12724. However, by itself it was not a problem for me so far.
I wrote a custom wandb logger which periodically uses the wandb (cloud) API to check how many artifacts are present and deletes old one. This is a bit brittle because it relies on the wandb service to actually be reachable and working bug-free all the time.

I suspect there was a wandb server side issue that they may have fixed now.

Answer 5 · 2023-05-13T15:06:16.000Z

If I am actually correct in my suspicion, a quick workaround to prevent future issues like this is to use the default PL wandb logger or any other logger which typically save checkpoints locally and do not rely on the wandb cloud api.

Answer 6 · 2023-05-14T02:27:43.000Z

Are you saying that the error is caused by an issue with the wandb cloud server itself, and that you didn't modify the code but it still works now? However, I just tried it and encountered the same error. I don't know if there is a problem with the wandb cloud itself again, and I don't know if you are also unable to work properly now. This is really a strange issue, maybe there is some luck involved during runtime.
same error as:

File "/home/zht/Python_project/RVT/loggers/wandb_logger.py", line 321, in _scan_and_log_checkpoints
self._rm_but_top_k(checkpoint_callback.save_top_k)
File "/home/zht/Python_project/RVT/loggers/wandb_logger.py", line 343, in _rm_but_top_k
score = artifact.metadata['score']
KeyError: 'score'

Answer 7 · 2023-05-15T01:54:47.000Z

I only tried to run resume training again. Do I need to train again before trying resume training since the previous issue was that wandb did not receive the data I uploaded during the previous training?

Answer 8 · 2023-05-15T08:33:31.000Z

Yes, only resume again. Unfortunately, I cannot reproduce the issue anymore which makes it hard to debug. I created a branch that should fix the issue (in a hacky way): https://github.com/uzh-rpg/RVT/tree/resume-issue
Let me know if thtat works for you.

Answer 9 · 2023-05-15T08:54:02.000Z

Thank you for your prompt response!

I will try as soon as possible and give you feedback on the results.

Answer 10 · 2023-05-16T02:24:13.000Z

Great! I think you have solved the problem and now I can continue training. However, at the beginning of the next training phase, the displayed speed seems a bit strange, like this:

Epoch 0: : 103073it [05:24, 317.98it/s, loss=2.84, v_num=60m8]

The displayed speed does not match the actual speed, but the displayed speed is slowly decreasing.
However, now I can resume the training normally, so I would like to close the issue!

Answer 11 · 2023-05-16T02:29:00.000Z

And thank you again for your patient and helpful guidance. You have been a great help to me!

Answer 12 · 2023-05-16T11:35:24.000Z

yeah, that is a known issue but does not impact training. You can ignore it. Glad that it solved your problem :)