mir-group/nequip

❓ [QUESTION] Restart run

IZugec opened this issue · 1 comments

Hello,

I have a situation in which I have really huge dataset so much so that even with multiprocessing it still takes day and a half/two days to preprocess it. Now, it happened that due to the unexpected crash on the node I would like to continue training starting from the best_model.pth weights. However I would really like to avoid processing this huge dataset again.

I tried both initial_model_state / initialize_from_state and load_model_state / load_model_state

however, when I started training initial model the key for append was false so now when I try to put it to false the error is

Traceback (most recent call last):
File "/home/user/.conda/envs/nequip_stress/bin/nequip-train", line 8, in
sys.exit(main())
File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 65, in main
raise RuntimeError(
RuntimeError: Training instance exists at /path_to_traning_dir; either set append to True or use a different root or runname

However when I start it with append equal to true I get following error

Traceback (most recent call last):
File "/home/user/.conda/envs/nequip_stress/bin/nequip-train", line 8, in
sys.exit(main())
File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 74, in main
trainer = restart(config)
File "/home/user/.conda/envs/nequip_stress/lib/python3.10/site-packages/nequip/scripts/train.py", line 220, in restart
raise ValueError(
ValueError: Key "append" is different in config and the result trainer.pth file. Please double check

I guess the question is if there is a way to pass already processed dataset along with model state?

Thanks in advance on any advice,
Ivan

Hi @IZugec ,

I tried both initial_model_state / initialize_from_state and load_model_state / load_model_state

This will be the easiest way forward, and will load the cached processed dataset unless something goes wrong. I think there should be a full discussion of how to do this here--- you want initialize_from_state and a new run name:

#235