Xiaoming-Yu/DMIT

Bug with --continue_train and Its solution

humberthumbert opened this issue · 5 comments

Hey there, there is a bug when one wants to continue to train a model. Simply adding --continue_train flag in the command line cannot succeed.

To continue training, one has to

  1. Go to /models/base_model.py, modify the line 64-65. Remove the quotation marks around net_name.
  2. And add the following code snippet immediately after line 65:
          for state in net_optimizer.state.values():
                for k, v in state.items():
                    if isinstance(v, torch.Tensor):
                        state[k] = v.cuda()

For the second step, it is my temporary solution. Without it, one might encounter a warning like: RuntimeError: expected device cpu but got device cuda:0. Though it can solve the error, I think there should be a more elegant way existing.

Hey,there.
Thanks for helping me solve the problem of continuing to train a model.
I still have some small trouble. The source code is only trained for 200 epochs. If I want to train more epochs, what should I do?

@loseway12138
In options/train_options.py, there are descriptions of available flags.
--niter # of iter at starting learning rate
--niter_decay # of iter to linearly decay learning rate to zero.
Both of them are set to 100 as default.

I noticed this:
--niter # of iter at starting learning rate
--niter_decay # of iter to linearly decay learning rate to zero.
Is it to modify the default parameter values of --niter and --niter_decay directly? For example, I need to train 400 epochs, do I set both of them to 200 by default?

@humberthumbert Thank you for pointing out this bug. We have fixed it.
For the second issue, we find it helpful to put the model on GPU net.to(self.device) before creating the optimizer net_optimizer = self.define_optimizer(net) (see line 62-63 in /models/base_model.py).

@loseway12138 If you need to train 400 epochs, niter and niter_decay can be set to 200.