QData/spacetimeformer

Error while resuming training from saved checkpoint

Opened this issue · 0 comments

Passing ckpt_path in lightening's .fit() method gives the below error for the line trainer.fit(forecaster, datamodule=data_module, ckpt_path='best.ckpt.ckpt'). The intent is to resume training from saved checkpoints.

Restoring states from the checkpoint path at best.ckpt.ckpt

==================================================================
| Name | Type | Params

0 | spacetimeformer | Spacetimeformer | 4.5 M

4.5 M Trainable params
0 Non-trainable params
4.5 M Total params
18.191 Total estimated model params size (MB)
Restored all states from the checkpoint file at best.ckpt.ckpt
Epoch 0: 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 105/140 [00:00<?, ?it/s]Traceback (most recent call last):
File "train_vol.py", line 457, in
trainer.fit(forecaster, datamodule=data_module, ckpt_path='best.ckpt.ckpt')
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
self._call_and_handle_interrupt(
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 722, in _call_and_handle_interrupt
return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 812, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1237, in _run
results = self._run_stage()
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1324, in _run_stage
return self._run_train()
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1354, in _run_train
self.fit_loop.run()
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 205, in run
self.on_advance_end()
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 297, in on_advance_end
self.trainer._call_callback_hooks("on_train_epoch_end")
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1637, in _call_callback_hooks
fn(self, self.lightning_module, *args, **kwargs)
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 179, in on_train_epoch_end
self._run_early_stopping_check(trainer)
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 190, in _run_early_stopping_check
if trainer.fast_dev_run or not self._validate_condition_metric( # disable early_stopping with fast_dev_run
File "/home/deepak.l/venv_spacetimeformer_13_sep/lib/python3.8/site-packages/pytorch_lightning/callbacks/early_stopping.py", line 145, in _validate_condition_metric
raise RuntimeError(error_msg)
RuntimeError: Early stopping conditioned on metric val/loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: ``