train vposer but torch.distributions.normal.Normal got Nan
lithiumice opened this issue · 2 comments
I try to retrain VPoser on AMASS dataset which I downloaded from the official website, I follow the instruction of README but still got this weird error. After training about 200 epoch, the code line 56 of src/human_body_prior/models/vposer_model.py" torch.distributions.normal.Normal
turn to get the Nan value. It seems like it is caused by data issues.
I will appreciate it if anyone can figure out why and how, or give me any insight.
#training_jobs to be done: 1
GPU available: True, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:1580: UserWarning: GPU available but not used. Set the gpus flag in your trainer `Trainer(gpus=1)` or script `--gpus=1`.
rank_zero_warn(
[V08_16] -- Total Trainable Parameters Count in vp_model is 0.94 M.
| Name | Type | Params
---------------------------------------
0 | vp_model | VPoser | 936 K
1 | bm_train | BodyModel | 0
---------------------------------------
936 K Trainable params
0 Non-trainable params
936 K Total params
3.745 Total estimated model params size (MB)
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.02 matrot:4.36 jtr:0.54 loss_total:5.95
Validation sanity check: 50%|█████████████████████████████████████████████████████████████ | 1/2 [00:02<00:02, 2.05s/it]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.34 jtr:0.53 loss_total:5.89
Validation sanity check: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:02<00:00, 1.07it/s][V08_16] -- Epoch 0: val_loss:0.51
[V08_16] -- lr is [0.001]
Training: 0it [00:00, ?it/s][V08_16] -- Created a git archive backup at /data/hualin/vposer_train_gen/V08_16/code/vposer_2023_08_17_13_44_54.tar.gz
Epoch 0: 0%| | 0/7637 [00:00<?, ?it/s]loss_kl:0.02 loss_mesh_rec:1.00 matrot:4.30 jtr:0.53 loss_total:5.86
/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/closure.py:35: LightningDeprecationWarning: One of the returned values {'log', 'progress_bar'} has a `grad_fn`. We will detach it automatically but this behaviour will change in v1.6. Please detach it manually: `return {'loss': ..., 'something': something.detach()}`
rank_zero_deprecation(
Epoch 0: 0%|
...
Epoch 0: 1%|█▌ | 107/7637 [00:25<30:11, 4.16it/s, loss=0.665, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.36 jtr:0.11 loss_total:0.69
Epoch 0: 1%|█▌ | 108/7637 [00:25<30:09, 4.16it/s, loss=0.663, v_num=30]loss_kl:0.08 loss_mesh_rec:0.13 matrot:0.37 jtr:0.11 loss_total:0.69
Epoch 0: 1%|█▌ | 109/7637 [00:26<30:08, 4.16it/s, loss=0.666, v_num=30]Traceback (most recent call last):
File "V02_05.py", line 55, in <module>
main()
File "V02_05.py", line 51, in main
train_vposer_once(job)
File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 361, in train_vposer_once
trainer.fit(model)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 737, in fit
self._call_and_handle_interrupt(
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 682, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 772, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1195, in _run
self._dispatch()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1275, in _dispatch
self.training_type_plugin.start_training(self)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 202, in start_training
self._results = trainer.run_stage()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1285, in run_stage
return self._run_train()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1315, in _run_train
self.fit_loop.run()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 234, in advance
self.epoch_loop.run(data_fetcher)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 193, in advance
batch_output = self.batch_loop.run(batch, batch_idx)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 88, in advance
outputs = self.optimizer_loop.run(split_batch, optimizers, batch_idx)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 145, in run
self.advance(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 215, in advance
result = self._run_optimization(
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 266, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 378, in _optimizer_step
lightning_module.optimizer_step(
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1652, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 164, in step
trainer.accelerator.optimizer_step(self._optimizer, self._optimizer_idx, closure, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 336, in optimizer_step
self.precision_plugin.optimizer_step(model, optimizer, opt_idx, closure, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 163, in optimizer_step
optimizer.step(closure=closure, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 140, in wrapper
out = func(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/optimizer.py", line 23, in _use_grad
ret = func(self, *args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/optim/adam.py", line 183, in step
loss = closure()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/precision_plugin.py", line 148, in _wrap_closure
closure_result = closure()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 160, in __call__
self._result = self.closure(*args, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 142, in closure
step_output = self._step_fn()
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/loops/optimization/optimizer_loop.py", line 435, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 216, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 213, in training_step
return self.model.training_step(*args, **kwargs)
File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 232, in training_step
drec = self(batch['pose_body'].view(-1, 63))
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin//vposer_66/src/human_body_prior/train/vposer_trainer.py", line 107, in forward
return self.vp_model(pose_body)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 121, in forward
q_z = self.encode(pose_body)
File "/home/hualin//vposer_66/src/human_body_prior/models/vposer_model.py", line 100, in encode
return self.encoder_net(pose_body)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/container.py", line 204, in forward
input = module(input)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/hualin///vposer_66/src/human_body_prior/models/vposer_model.py", line 56, in forward
return torch.distributions.normal.Normal(self.mu(Xout), F.softplus(self.logvar(Xout)))
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/normal.py", line 56, in __init__
super(Normal, self).__init__(batch_shape, validate_args=validate_args)
File "/home/hualin/miniconda3/envs/PriorMDM/lib/python3.8/site-packages/torch/distributions/distribution.py", line 56, in __init__
raise ValueError(
ValueError: Expected parameter loc (Tensor of shape (128, 32)) of distribution Normal(loc: torch.Size([128, 32]), scale: torch.Size([128, 32])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
...,
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan],
[nan, nan, nan, ..., nan, nan, nan]], grad_fn=<AddmmBackward0>)
Hi, have you solved this problem? I got the same error after the first several iterations of training.
ValueError: Expected parameter loc (Tensor of shape (128, 32)) of distribution Normal(loc: torch.Size([128, 32]), scale: torch.Size([128, 32])) to satisfy the constraint Real(), but found invalid values: tensor([[nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], ..., [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan], [nan, nan, nan, ..., nan, nan, nan]], device='cuda:0', grad_fn=<AddmmBackward0>) Epoch 0: 0%| | 106/145399 [00:05<2:08:50, 18.79it/s, v_num=6, train_loss=0.777]
I've successfully identified and fixed a bug in the geodesic_loss_R class, which is part of the loss function used in VPoser. The issue was related to the calculation of the cosine values in the geodesic loss function for rotation matrices. The modified code snippet in src/human_body_prior/tools/angle_continuous_repres.py
is shown below:
class geodesic_loss_R(nn.Module):
def __init__(self, reduction='batchmean'):
super(geodesic_loss_R, self).__init__()
self.reduction = reduction
self.eps = 1e-6
# batch geodesic loss for rotation matrices
def bgdR(self, m1, m2):
m = torch.bmm(m1, m2.transpose(1, 2)) # batch*3*3
cos = (m[:, 0, 0] + m[:, 1, 1] + m[:, 2, 2] - 1) / 2
# the fixed bug
cos = torch.clamp(cos, -1 + self.eps, 1 - self.eps)
return torch.acos(cos)
def forward(self, ypred, ytrue):
theta = self.bgdR(ypred, ytrue)
if self.reduction == 'mean':
return torch.mean(theta)
if self.reduction == 'batchmean':
return torch.mean(torch.sum(theta, dim=theta.shape[1:]))
else:
return theta