wandb resume
Closed this issue · 7 comments
hi, @magehrig I'm so sorry to bother you again, when I running train.py
script my lab lost power suddenly, so I had to resume the running progress.
Unfortunately, I'm not good at using wandb
, and I tried to resume it but there are some problem raised. I'm soo sorry to bother you, I tried but I really can not solve the problem. 😭😭😭
So I want to ask you about means of some parameter at config
file:
wandb_runpath
、artifact_name
、artifact_local_file
How should I set the values? Here are my checkpoint
file and wandb
file:
I can set ckpt_path
directly in train.py
script, it can work, but I want the changed result also show in wandb
!
I know it's a really easy problem, but I can not solve it! I'm appreciate for give me some guidance!!!
Looking forward to your reply!
Best!
Hi @Chrazqee
The general description is in the config itself.
Assuming you want to restart your training (without DDP) you would set it as follows:
wandb_runpath=USERNAME/RNN-ST-2/ID (see: online wandb interface -> Overview -> Run path)
artifact_name=USERNAME/RNN-ST-2/checkpoint-ID-last:vX (see: online wandb interface -> Overview -> Artifact Outputs -> Full name)
If you are running the model with DDP you would additionally set artifact_local_file
to the local path where you downloaded/saved the checkpoint.
Let me know if that helps. If not, please show the command that you used and the error message.
Hi @magehrig , thank you so much. I really not goot at using wandb
, Your guidence is so clear that I can solve the problem immediately. I'm so sorry bothering you since I can find the solution in config
file.
Although there are also some little wrong maybe, I'll solve it by myself!
Thanks you so much again!
Best!
Hello Chrazqee and magherig, I can't seem to find the checkpoints in the artifacts page in my wandb project on the website. I do have the checkpoint on my local repo. I only have jobs and history in the artifacts page on the website.
Hello Chrazqee and magherig, I can't seem to find the checkpoints in the artifacts page in my wandb project on the website. I do have the checkpoint on my local repo. I only have jobs and history in the artifacts page on the website.
Hi, @ostromb, I followed Magehrig's recommendations and solved my issue. I think you should check it on website again.
These are the artifacts that I can see on the webpage, which are history and jobs: https://i.imgur.com/4okLveW.png . Where can I find the checkpoints?
These are the artifacts that I can see on the webpage, which are history and jobs: https://i.imgur.com/4okLveW.png . Where can I find the checkpoints?
You should set wandb_runpath=xxx
and artifact_name=xxx
to your wandb counterpart in the config>general.yaml
file. For example, I need to set it to: artifact_name=chan_0613/RVT/checkpoint-pbzw7j79-last:v7
. You can try it with author's guildance in the config>general.yaml
file. Hope you can solve it!
Here are my wandb
website page shotcut.
Did you do any changes to the repo code/confgs to produce the model artifacts? It seems my runs are not producing any model artifacts at all…