philgras/neural-head-avatars

optimize_avatar error

Closed this issue · 4 comments

dear philgras:can you help me ?

Z:\python\philgras_neural_head_avatars\neural_head_avatars2>Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\python python_scripts/optimize_nha.py --config configs/optimize_avatar.ini
Start Model training with the following configuration:
Command Line Args: --config configs/optimize_avatar.ini
Config File (configs/optimize_avatar.ini):
image_log_period: 20
num_sanity_val_steps:0
gpus: 1
distributed_backend:ddp
accelerator: ddp
default_root_dir: demo/optimized_avatars
data_path: demo/input_video
split_config: configs/split.json
tracking_results_path:demo/input_video2/tracking_0/tracked_flame_params.npz
data_worker: 8
load_lmk: true
load_seg: true
load_camera: true
load_flame: true
load_normal: true
load_parsing: true
train_batch_size: [4, 2, 2]
validation_batch_size:[2, 2, 2]
epochs_offset: 150
epochs_texture: 50
epochs_joint: 50
flame_lr: [0.001, 0.01, 0.0002]
offset_lr: [1e-05, 1e-05, 2e-06]
tex_lr: [0.0001, 5e-05, 2e-05]
spatial_blur_sigma:0.01
offset_hidden_layers:6
offset_hidden_feats:128
texture_hidden_feats:256
texture_hidden_layers:8
d_normal_encoding: 32
d_normal_encoding_hidden:128
n_normal_encoding_hidden:2
subdivide_mesh: 1
flame_noise: .1
soft_clip_sigma: 0.1
body_part_weights: configs/body_part_weights.json
w_rgb: [0, 1, 0.05]
w_perc: [0, 10, 0.5]
w_norm: [0.02, 0.02, 0.02]
w_edge: [10.0, 10.0, 10.0]
w_eye_closed: [100000.0, 100000.0, 100000.0]
w_semantic_ear: [0.1, 0.1, 0.1]
w_semantic_eye: [0.1, 0.1, 0.1]
w_semantic_hair: [[0.1, 50], [0.01, 100]]
w_silh: [[0.01, 50], [0.1, 100]]
w_lap: [[0.05, 50], [0.05, 100]]
w_surface_reg: [0.0001, 0.0001, 0.0001]
w_lmk: [0.01, 0.1, 0]
w_shape_reg: [0.001, 0.001, 0.001]
w_expr_reg: [0.001, 0.001, 0.001]
w_pose_reg: [0.001, 0.001, 0.001]
texture_weight_decay:[0.0001, 0.0001, 5e-06]
Defaults:
--texture_d_hidden_dynamic:128
--texture_n_hidden_dynamic:1
--glob_rot_noise: 5.0
--semantics_blur: 3
--w_semantic_mouth:[0.1, 0.1, 0.1]
--logger: True
--checkpoint_callback:True
--gradient_clip_val:0
--process_position:0
--num_nodes: 1
--num_processes: 1
--auto_select_gpus:False
--tpu_cores: <function _gpus_arg_default at 0x00000254E4989280>
--overfit_batches: 0.0
--track_grad_norm: -1
--check_val_every_n_epoch:1
--fast_dev_run: False
--accumulate_grad_batches:1
--limit_train_batches:1.0
--limit_val_batches:1.0
--limit_test_batches:1.0
--limit_predict_batches:1.0
--val_check_interval:1.0
--flush_logs_every_n_steps:100
--log_every_n_steps:50
--sync_batchnorm: False
--precision: 32
--weights_summary: top
--benchmark: False
--deterministic: False
--reload_dataloaders_every_epoch:False
--auto_lr_find: False
--replace_sampler_ddp:True
--terminate_on_nan:False
--auto_scale_batch_size:False
--prepare_data_per_node:True
--amp_backend: native
--amp_level: O2
--move_metrics_to_cpu:False
--multiple_trainloader_mode:max_size_cycle
--stochastic_weight_avg:False
--checkpoint_file:

[05/20 04:39:43 nha.data.real]: Collected real training dataset containing: 201 samples.
[05/20 04:39:43 nha.data.real]: Collected real validation dataset containing: 100 samples.
Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\utilities\distributed.py:51: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
warnings.warn(*args, **kwargs)
Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch3d-0.6.1-py3.9-win-amd64.egg\pytorch3d\structures\meshes.py:1108: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
self._edges_packed = torch.stack([u // V, u % V], dim=1)
[05/20 04:40:10 nha.optimization.train_pl_module]: Running the offset-optimization stage.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

| Name | Type | Params

0 | _flame | FlameHead | 0
1 | _offset_mlp | OffsetMLP | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture | TextureMLP | 1.8 M
4 | _explFeatures | MultiTexture | 4.5 M
5 | _leaky_hinge | LeakyHingeLoss | 0
6 | _masked_L1 | MaskedCriterion | 0

7.9 M Trainable params
0 Non-trainable params
7.9 M Total params
31.623 Total estimated model params size (MB)
Epoch 0: 0%| | 0/101 [00:44<?, ?it/s]
Traceback (most recent call last):
File "Z:\python\philgras_neural_head_avatars\neural_head_avatars2\python_scripts\optimize_nha.py", line 12, in
train_pl_module(NHAOptimizer, RealDataModule)
File "Z:\python/philgras_neural_head_avatars/neural_head_avatars2\nha\optimization\train_pl_module.py", line 89, in train_pl_module
trainer.fit(model,
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
self.dispatch()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
self.accelerator.start_training(self)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
self.train_loop.run_training_epoch()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
self._curr_step_result = self.training_step(
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
return self.training_type_plugin.training_step(*args)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
return self.model(*args, **kwargs)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
self._sync_params()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
self._distributed_broadcast_coalesced(
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

This looks more like a pytorch / pytorch_lightning related error. Note how the error is raised in "torch\nn\parallel\distributed.py". Have you made sure that your pytorch installation is working properly. Also, which version of pytorch and pytorch-lightning are you using?

Did you manage to solve the issue? Feel free to reopen if it persists.

I'm having the smae error on windows., I'm trying to find the solution

  | Name            | Type               | Params
-------------------------------------------------------
0 | _flame          | FlameHead          | 0
1 | _offset_mlp     | OffsetMLP          | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture        | TextureMLP         | 1.8 M
4 | _explFeatures   | MultiTexture       | 4.5 M
5 | _leaky_hinge    | LeakyHingeLoss     | 0
6 | _masked_L1      | MaskedCriterion    | 0
-------------------------------------------------------
7.9 M     Trainable params
0         Non-trainable params
7.9 M     Total params
31.494    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                   | 0/5 [00:38<?, ?it/s]
Traceback (most recent call last):
  File "D:\MOCAP\neural-head-avatars\python_scripts\optimize_nha.py", line 11, in <module>
    train_pl_module(NHAOptimizer, RealDataModule)
  File "d:\mocap\neural-head-avatars\nha\optimization\train_pl_module.py", line 88, in train_pl_module
    trainer.fit(model,
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
    self.dispatch()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
    self.train_loop.run_training_epoch()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
    self._curr_step_result = self.training_step(
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
    return self.training_type_plugin.training_step(*args)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
    return self.model(*args, **kwargs)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
    self._sync_params()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

(neural) D:\MOCAP\neural-head-avatars>

@leeooo001 I ended up making it work changing a couple of things.
on optimize_nha.py I added these 2 lines to force the script not to use NCCL
image

and changed the settings on optimize_avatar.ini
basically, I removed the ddp from the distributed_backend and accelerator
image

At least, for me, seemed that the issue on windows was due to the distributed process.
It took me 2 days to solve this, and it was all inside the configuration :D