gradient error in Joint Optimization
hongsiyu opened this issue · 6 comments
I train successfully in shape pre-training but stuck in joint optimization.
2022-09-27 02:30:25.358618: E tensorflow/core/kernels/check_numerics_op.cc:289] abnormal_detected_host @0x7f43f6808a00 = {1, 0} Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo'
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]]
[[Identity_6/_372]]
(1) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[node gradient_tape/model/CheckNumerics (defined at tmp/tmp398ckawp.py:22) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_distributed_train_step_45946]
I use my own data which's cameras are calculated by colmap.
I use my own data which's cameras are calculated by colmap.
Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.
I use my own data which's cameras are calculated by colmap.
Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.
I have set lr at 5e-4 and 5e-5, and still met same error.
I use my own data which's cameras are calculated by colmap.
Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.
I have set lr at 5e-4 and 5e-5, and still met same error.
Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".
I use my own data which's cameras are calculated by colmap.
Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.
I have set lr at 5e-4 and 5e-5, and still met same error.
Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".
Yep, I directly change lr in config of shape_mvs.ini
I use my own data which's cameras are calculated by colmap.
Trun down learning rate. Same as you, I trained my own data created by Blender, when I use the default learning rate(5e-3), I got the same ERROR as you, when I turn down the learning rate to 5e-4, everything is ok.
I have set lr at 5e-4 and 5e-5, and still met same error.
Did you override lr in config_override of Joint Optimization (training and validation) ? e.g. --config_override="lr=$lr".
Yep, I directly change lr in config of shape_mvs.ini
and nerfactor_mvs.ini