Error when resume training

Question

Error when resume training

yaju1234 opened this issue 2 years ago · 1 comments

I trained the model with my own 62k human dataset similar to DUTS with 60 epochs. The result was not desired so I decided to resume the training with more epochs but I got the following error when resuming the training.

Traceback (most recent call last):
File "run/Train.py", line 178, in
train(opt, args)
File "run/Train.py", line 140, in train
optimizer.step()
File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/lr_scheduler.py", line 65, in wrapper
return wrapped(*args, **kwargs)
File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/optimizer.py", line 88, in wrapper
return func(*args, **kwargs)
File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs)
File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/adam.py", line 144, in step
eps=group['eps'])
File "/root/inspyrenet/venv/lib/python3.6/site-packages/torch/optim/functional.py", line 98, in adam
param.addcdiv(exp_avg, denom, value=-step_size)
RuntimeError: value cannot be converted to type float without overflow: (-9.99425e-08,-1.86689e-11)

Answer 1 · 2023-02-15T00:58:58.000Z

Hi, we implemented the resuming part for unexpectecd shutdown or other accidents, so because of the loaded state, all the other parts such as learning rate decay or optimizer, scheduler are coupled with 60 epochs setting.

I would recommend to train from the beginning for 200 epochs setting.
Thanks