google/nerfactor

gradient error in Joint Optimization

hongsiyu opened this issue · 6 comments

I train successfully in shape pre-training but stuck in joint optimization.

2022-09-27 02:30:25.358618: E tensorflow/core/kernels/check_numerics_op.cc:289] abnormal_detected_host @0x7f43f6808a00 = {1, 0} Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo'
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]]
[[Identity_6/_372]]
(1) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_45946]

I use my own data which's cameras are calculated by colmap.

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

I use my own data which's cameras are calculated by colmap.

Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.

I have set lr at 5e-4 and 5e-5, and still met same error.

Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".

Yep, I directly change lr in config of shape_mvs.ini

and nerfactor_mvs.ini