optimize_avatar error

Question

optimize_avatar error

Closed this issue 2 years ago · 4 comments

dear philgras：can you help me ?

Z:\python\philgras_neural_head_avatars\neural_head_avatars2>Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\python python_scripts/optimize_nha.py --config configs/optimize_avatar.ini
Start Model training with the following configuration:
Command Line Args: --config configs/optimize_avatar.ini
Config File (configs/optimize_avatar.ini):
image_log_period: 20
num_sanity_val_steps:0
gpus: 1
distributed_backend:ddp
accelerator: ddp
default_root_dir: demo/optimized_avatars
data_path: demo/input_video
split_config: configs/split.json
tracking_results_path:demo/input_video2/tracking_0/tracked_flame_params.npz
data_worker: 8
load_lmk: true
load_seg: true
load_camera: true
load_flame: true
load_normal: true
load_parsing: true
train_batch_size: [4, 2, 2]
validation_batch_size:[2, 2, 2]
epochs_offset: 150
epochs_texture: 50
epochs_joint: 50
flame_lr: [0.001, 0.01, 0.0002]
offset_lr: [1e-05, 1e-05, 2e-06]
tex_lr: [0.0001, 5e-05, 2e-05]
spatial_blur_sigma:0.01
offset_hidden_layers:6
offset_hidden_feats:128
texture_hidden_feats:256
texture_hidden_layers:8
d_normal_encoding: 32
d_normal_encoding_hidden:128
n_normal_encoding_hidden:2
subdivide_mesh: 1
flame_noise: .1
soft_clip_sigma: 0.1
body_part_weights: configs/body_part_weights.json
w_rgb: [0, 1, 0.05]
w_perc: [0, 10, 0.5]
w_norm: [0.02, 0.02, 0.02]
w_edge: [10.0, 10.0, 10.0]
w_eye_closed: [100000.0, 100000.0, 100000.0]
w_semantic_ear: [0.1, 0.1, 0.1]
w_semantic_eye: [0.1, 0.1, 0.1]
w_semantic_hair: [[0.1, 50], [0.01, 100]]
w_silh: [[0.01, 50], [0.1, 100]]
w_lap: [[0.05, 50], [0.05, 100]]
w_surface_reg: [0.0001, 0.0001, 0.0001]
w_lmk: [0.01, 0.1, 0]
w_shape_reg: [0.001, 0.001, 0.001]
w_expr_reg: [0.001, 0.001, 0.001]
w_pose_reg: [0.001, 0.001, 0.001]
texture_weight_decay:[0.0001, 0.0001, 5e-06]
Defaults:
--texture_d_hidden_dynamic:128
--texture_n_hidden_dynamic:1
--glob_rot_noise: 5.0
--semantics_blur: 3
--w_semantic_mouth:[0.1, 0.1, 0.1]
--logger: True
--checkpoint_callback:True
--gradient_clip_val:0
--process_position:0
--num_nodes: 1
--num_processes: 1
--auto_select_gpus:False
--tpu_cores: <function _gpus_arg_default at 0x00000254E4989280>
--overfit_batches: 0.0
--track_grad_norm: -1
--check_val_every_n_epoch:1
--fast_dev_run: False
--accumulate_grad_batches:1
--limit_train_batches:1.0
--limit_val_batches:1.0
--limit_test_batches:1.0
--limit_predict_batches:1.0
--val_check_interval:1.0
--flush_logs_every_n_steps:100
--log_every_n_steps:50
--sync_batchnorm: False
--precision: 32
--weights_summary: top
--benchmark: False
--deterministic: False
--reload_dataloaders_every_epoch:False
--auto_lr_find: False
--replace_sampler_ddp:True
--terminate_on_nan:False
--auto_scale_batch_size:False
--prepare_data_per_node:True
--amp_backend: native
--amp_level: O2
--move_metrics_to_cpu:False
--multiple_trainloader_mode:max_size_cycle
--stochastic_weight_avg:False
--checkpoint_file:

[05/20 04:39:43 nha.data.real]: Collected real training dataset containing: 201 samples.
[05/20 04:39:43 nha.data.real]: Collected real validation dataset containing: 100 samples.
Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\utilities\distributed.py:51: UserWarning: ModelCheckpoint(save_last=True, monitor=None) is a redundant configuration. You can save the last checkpoint with ModelCheckpoint(save_top_k=None, monitor=None).
warnings.warn(*args, **kwargs)
Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch3d-0.6.1-py3.9-win-amd64.egg\pytorch3d\structures\meshes.py:1108: UserWarning: floordiv is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
self._edges_packed = torch.stack([u // V, u % V], dim=1)
[05/20 04:40:10 nha.optimization.train_pl_module]: Running the offset-optimization stage.
GPU available: True, used: True
TPU available: None, using: 0 TPU cores
initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1

| Name | Type | Params

0 | _flame | FlameHead | 0
1 | _offset_mlp | OffsetMLP | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture | TextureMLP | 1.8 M
4 | _explFeatures | MultiTexture | 4.5 M
5 | _leaky_hinge | LeakyHingeLoss | 0
6 | _masked_L1 | MaskedCriterion | 0

7.9 M Trainable params
0 Non-trainable params
7.9 M Total params
31.623 Total estimated model params size (MB)
Epoch 0: 0%| | 0/101 [00:44<?, ?it/s]
Traceback (most recent call last):
File "Z:\python\philgras_neural_head_avatars\neural_head_avatars2\python_scripts\optimize_nha.py", line 12, in
train_pl_module(NHAOptimizer, RealDataModule)
File "Z:\python/philgras_neural_head_avatars/neural_head_avatars2\nha\optimization\train_pl_module.py", line 89, in train_pl_module
trainer.fit(model,
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
self.dispatch()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
self.accelerator.start_training(self)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
self.train_loop.run_training_epoch()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
self._curr_step_result = self.training_step(
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
training_step_output = self.trainer.accelerator.training_step(args)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
return self.training_type_plugin.training_step(*args)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
return self.model(*args, **kwargs)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
self._sync_params()
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
self._distributed_broadcast_coalesced(
File "Z:\python\philgras_neural_head_avatars\WPy64-39100\python-3.9.10.amd64\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

Answer 1 · 2022-06-27T13:45:37.000Z

This looks more like a pytorch / pytorch_lightning related error. Note how the error is raised in "torch\nn\parallel\distributed.py". Have you made sure that your pytorch installation is working properly. Also, which version of pytorch and pytorch-lightning are you using?

Answer 2 · 2022-07-15T10:53:45.000Z

Did you manage to solve the issue? Feel free to reopen if it persists.

Answer 3 · 2022-09-20T08:31:26.000Z

I'm having the smae error on windows., I'm trying to find the solution

  | Name            | Type               | Params
-------------------------------------------------------
0 | _flame          | FlameHead          | 0
1 | _offset_mlp     | OffsetMLP          | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture        | TextureMLP         | 1.8 M
4 | _explFeatures   | MultiTexture       | 4.5 M
5 | _leaky_hinge    | LeakyHingeLoss     | 0
6 | _masked_L1      | MaskedCriterion    | 0
-------------------------------------------------------
7.9 M     Trainable params
0         Non-trainable params
7.9 M     Total params
31.494    Total estimated model params size (MB)
Epoch 0:   0%|                                                                                   | 0/5 [00:38<?, ?it/s]
Traceback (most recent call last):
  File "D:\MOCAP\neural-head-avatars\python_scripts\optimize_nha.py", line 11, in <module>
    train_pl_module(NHAOptimizer, RealDataModule)
  File "d:\mocap\neural-head-avatars\nha\optimization\train_pl_module.py", line 88, in train_pl_module
    trainer.fit(model,
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 498, in fit
    self.dispatch()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 545, in dispatch
    self.accelerator.start_training(self)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\trainer.py", line 636, in run_train
    self.train_loop.run_training_epoch()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 493, in run_training_epoch
    batch_output = self.run_training_batch(batch, batch_idx, dataloader_idx)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 658, in run_training_batch
    self._curr_step_result = self.training_step(
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\trainer\training_loop.py", line 293, in training_step
    training_step_output = self.trainer.accelerator.training_step(args)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\accelerators\accelerator.py", line 156, in training_step
    return self.training_type_plugin.training_step(*args)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\pytorch_lightning\plugins\training_type\ddp.py", line 294, in training_step
    return self.model(*args, **kwargs)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\modules\module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 878, in forward
    self._sync_params()
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1379, in _sync_params
    self._distributed_broadcast_coalesced(
  File "C:\Users\Pichau\.conda\envs\neural\lib\site-packages\torch\nn\parallel\distributed.py", line 1334, in _distributed_broadcast_coalesced
    dist._broadcast_coalesced(
RuntimeError: Invalid scalar type

(neural) D:\MOCAP\neural-head-avatars>

Answer 4 · 2022-09-20T17:41:46.000Z

@leeooo001 I ended up making it work changing a couple of things.
on optimize_nha.py I added these 2 lines to force the script not to use NCCL

and changed the settings on optimize_avatar.ini
basically, I removed the ddp from the distributed_backend and accelerator

At least, for me, seemed that the issue on windows was due to the distributed process.
It took me 2 days to solve this, and it was all inside the configuration :D

| Name | Type | Params

0 | _flame | FlameHead | 0 1 | _offset_mlp | OffsetMLP | 616 K 2 | _normal_encoder | SirenNormalEncoder | 542 K 3 | _texture | TextureMLP | 1.8 M 4 | _explFeatures | MultiTexture | 4.5 M 5 | _leaky_hinge | LeakyHingeLoss | 0 6 | _masked_L1 | MaskedCriterion | 0

0 | _flame | FlameHead | 0
1 | _offset_mlp | OffsetMLP | 616 K
2 | _normal_encoder | SirenNormalEncoder | 542 K
3 | _texture | TextureMLP | 1.8 M
4 | _explFeatures | MultiTexture | 4.5 M
5 | _leaky_hinge | LeakyHingeLoss | 0
6 | _masked_L1 | MaskedCriterion | 0