johannbrehmer/manifold-flow

NaN error

zaocan666 opened this issue · 3 comments

Hi, excellent work here.
I encountered NaN error when training with the config configs/train_mf_gan64d_april.config:
Traceback (most recent call last): File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 592, in <module> learning_curves = train_model(args, dataset, model, simulator) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 504, in train_model learning_curves = train_manifold_flow_sequential(args, dataset, model, simulator) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/train.py", line 276, in train_manifold_flow_sequential learning_curves = trainer1.train( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 307, in train loss_train, loss_val, loss_contributions_train, loss_contributions_val = self.epoch( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 380, in epoch batch_loss, batch_loss_contributions = self.batch_train( File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 513, in batch_train loss_contributions = self.forward_pass(batch_data, loss_functions, forward_kwargs=forward_kwargs, custom_kwargs=custom_kwargs) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 633, in forward_pass self._check_for_nans("Reconstructed data", x_reco) File "/home/urkax/project/GenFed/manifold-flow-public/experiments/training/trainer.py", line 122, in _check_for_nans raise NanException training.trainer.NanException

I am using 5 GPUs, pytorch 1.7.1
Have you ever encountered such problem?

I find that this occurs only when I use multiple GPUs for training, but I do not know why

I trained the MSE on all parameters instead of some part of parameters (when training celeba_emf). It somehow works. And decreasing the learning rate also helps.