Problems in running the forecasting network

Question

Problems in running the forecasting network

Opened this issue 4 months ago · 1 comments

Hi, so I trained the interpolator network on the spring mesh database and killed the run after a few epochs. This saved the last.ckpt file locally in the results/checkpoints directory and gave me a run-id for the same(which weirdly is not an alphanumeric, but rather an 8-digit integer number). So, I can't run the command
python run.py experiment=spring_mesh_dyffusion diffusion.interpolator_run_id=<WANDB_RUN_ID>
because it keeps on giving me an AssertionError: run_id must be a string, but is <class 'int'>: 48422040

So as mentioned in dyffusion.yaml I set the run_id to the mentioned run_id for the run and then file name to the desired file name of "last.ckpt" and ran python run.py experiment=spring_mesh_dyffusion
But am facing this error

File "/home/vd2298/reimplementation/src/train.py", line 66, in run_model
   ckpt_path2 = wandb.restore(ckpt_filename, run_path=wandb.run.path, replace=True, root=os.getcwd()).name
                
 File "/scratch/vd2298/envs/dyffusion/lib/python3.12/site-packages/wandb/sdk/wandb_run.py", line 4225, in restore
   raise ValueError(f"File {name} not found in {run_path or root}.")
ValueError: File last.ckpt not found in vd2298-new-york-university/DYffusion-spring-mesh/48422040.

It seems that while my checkpoint is saved locally and the file last.ckpt is also reflected on my wandb

So I decided to use the other option of mentioning the local path in dyffusion.yaml but that doesn't seem to work as well. it keeps on going back to an older run_id and doesn't want to start the forecasting network at all. Can you please suggest what I should be trying next? or point what am I doing wrong?

Answer 1 · 2024-10-04T22:38:45.000Z

Hi!

Regarding your first problem. If the ID is a number, can you just make it into a string and see if that fixes it? E.g. do diffusion.interpolator_run_id="<WANDB_RUN_ID>". Let me know if it doesn't.

Regarding your second problem, that sounds weird since it does look like the correct file is saved on wandb. If it's still a problem can you email me so that we can maybe look into it together? Otherwise, it would help if you could provide me with the exact command that you ran and a public wandb link to the problematic run.

Thanks!