uzh-rpg/RVT

wandb resume

Closed this issue · 7 comments

hi, @magehrig I'm so sorry to bother you again, when I running train.py script my lab lost power suddenly, so I had to resume the running progress.
Unfortunately, I'm not good at using wandb, and I tried to resume it but there are some problem raised. I'm soo sorry to bother you, I tried but I really can not solve the problem. 😭😭😭

So I want to ask you about means of some parameter at config file:
wandb_runpathartifact_nameartifact_local_file

How should I set the values? Here are my checkpoint file and wandb file:
image
image

I can set ckpt_path directly in train.py script, it can work, but I want the changed result also show in wandb !

I know it's a really easy problem, but I can not solve it! I'm appreciate for give me some guidance!!!

Looking forward to your reply!
Best!

Hi @Chrazqee

The general description is in the config itself.

Assuming you want to restart your training (without DDP) you would set it as follows:

wandb_runpath=USERNAME/RNN-ST-2/ID (see: online wandb interface -> Overview -> Run path)
artifact_name=USERNAME/RNN-ST-2/checkpoint-ID-last:vX (see: online wandb interface -> Overview -> Artifact Outputs -> Full name)

If you are running the model with DDP you would additionally set artifact_local_file to the local path where you downloaded/saved the checkpoint.

Let me know if that helps. If not, please show the command that you used and the error message.

Hi @magehrig , thank you so much. I really not goot at using wandb, Your guidence is so clear that I can solve the problem immediately. I'm so sorry bothering you since I can find the solution in config file.

Here are my result:
image

Although there are also some little wrong maybe, I'll solve it by myself!

Thanks you so much again!

Best!

Hello Chrazqee and magherig, I can't seem to find the checkpoints in the artifacts page in my wandb project on the website. I do have the checkpoint on my local repo. I only have jobs and history in the artifacts page on the website.

Hello Chrazqee and magherig, I can't seem to find the checkpoints in the artifacts page in my wandb project on the website. I do have the checkpoint on my local repo. I only have jobs and history in the artifacts page on the website.

Hi, @ostromb, I followed Magehrig's recommendations and solved my issue. I think you should check it on website again.

These are the artifacts that I can see on the webpage, which are history and jobs: https://i.imgur.com/4okLveW.png . Where can I find the checkpoints?

These are the artifacts that I can see on the webpage, which are history and jobs: https://i.imgur.com/4okLveW.png . Where can I find the checkpoints?

You should set wandb_runpath=xxx and artifact_name=xxx to your wandb counterpart in the config>general.yaml file. For example, I need to set it to: artifact_name=chan_0613/RVT/checkpoint-pbzw7j79-last:v7. You can try it with author's guildance in the config>general.yaml file. Hope you can solve it!
Here are my wandb website page shotcut.
image

Did you do any changes to the repo code/confgs to produce the model artifacts? It seems my runs are not producing any model artifacts at all…