Problems in running the forecasting network
Opened this issue · 1 comments
Hi, so I trained the interpolator network on the spring mesh database and killed the run after a few epochs. This saved the last.ckpt file locally in the results/checkpoints directory and gave me a run-id for the same(which weirdly is not an alphanumeric, but rather an 8-digit integer number). So, I can't run the command
python run.py experiment=spring_mesh_dyffusion diffusion.interpolator_run_id=<WANDB_RUN_ID>
because it keeps on giving me an AssertionError: run_id must be a string, but is <class 'int'>: 48422040
So as mentioned in dyffusion.yaml I set the run_id to the mentioned run_id for the run and then file name to the desired file name of "last.ckpt" and ran python run.py experiment=spring_mesh_dyffusion
But am facing this error
File "/home/vd2298/reimplementation/src/train.py", line 66, in run_model
ckpt_path2 = wandb.restore(ckpt_filename, run_path=wandb.run.path, replace=True, root=os.getcwd()).name
File "/scratch/vd2298/envs/dyffusion/lib/python3.12/site-packages/wandb/sdk/wandb_run.py", line 4225, in restore
raise ValueError(f"File {name} not found in {run_path or root}.")
ValueError: File last.ckpt not found in vd2298-new-york-university/DYffusion-spring-mesh/48422040.
It seems that while my checkpoint is saved locally and the file last.ckpt is also reflected on my wandb
So I decided to use the other option of mentioning the local path in dyffusion.yaml but that doesn't seem to work as well. it keeps on going back to an older run_id and doesn't want to start the forecasting network at all. Can you please suggest what I should be trying next? or point what am I doing wrong?
Hi!
Regarding your first problem. If the ID is a number, can you just make it into a string and see if that fixes it? E.g. do diffusion.interpolator_run_id="<WANDB_RUN_ID>". Let me know if it doesn't.
Regarding your second problem, that sounds weird since it does look like the correct file is saved on wandb. If it's still a problem can you email me so that we can maybe look into it together? Otherwise, it would help if you could provide me with the exact command that you ran and a public wandb link to the problematic run.
Thanks!