NaN or Inf in 'Albedo' at step II. Joint Optimization
CorneliusHsiao opened this issue · 2 comments
CorneliusHsiao commented
Hi,
Great work. I am training your model on my own dataset in real-data format. However, it always reports the following error message when processing step II. Joint Optimization
in Training, Validation, and Testing. Could you provide me some insight about what configuration/data format might be wrong?
Error message
Exception has occurred: InvalidArgumentError (note: full exception trace is shown but execution is paused at: _run_module_as_main)
2 root error(s) found.
(0) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[{{node cond/else/_1/StatefulPartitionedCall/gradient_tape/model/CheckNumerics_2}}]]
[[cond/else/_1/StatefulPartitionedCall/replica_1/model/assert_greater_3/Assert/AssertGuard/branch_executed/_57539/_6203]]
(1) Invalid argument: Not a number (NaN) or infinity (Inf) values detected in gradient. b'Albedo' : Tensor had NaN values
[[{{node cond/else/_1/StatefulPartitionedCall/gradient_tape/model/CheckNumerics_2}}]]
0 successful operations.
3 derived errors ignored. [Op:__inference_fn_with_cond_190304]
Function call stack:
fn_with_cond -> fn_with_cond
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
self.captured_inputs)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 708, in _call
return function_lib.defun(fn_with_cond)(*canon_args, **canon_kwds)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/admin/FaceReal/nerfactor/nerfactor/trainvali.py", line 181, in main
strategy, model, batch, optimizer, global_bs_train)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/home/admin/FaceReal/nerfactor/nerfactor/trainvali.py", line 341, in <module>
app.run(main)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 96, in _run_module_code
mod_name, mod_spec, pkg_name, script_name)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 263, in run_path
pkg_name=pkg_name, script_name=fname)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/admin/anaconda3/envs/nerfactor/lib/python3.6/runpy.py", line 193, in _run_module_as_main (Current frame)
"__main__", mod_spec)
My dataset directory looks like
root
│ transforms_test.json
│ transforms_train.json
│ transforms_val.json
│
├───test_000
│ metadata.json
│ nn.png
│ rgba.png
│
├───train_000
│ albedo.png
│ metadata.json
│ rgba.png
│
├───train_001
│ metadata.json
│ rgba.png
│
├───train_002
│ metadata.json
│ rgba.png
│
├───train_003
│ metadata.json
│ rgba.png
│
├───train_004
│ metadata.json
│ rgba.png
│
├───train_005
│ metadata.json
│ rgba.png
│
├───train_006
│ metadata.json
│ rgba.png
│
├───train_007
│ metadata.json
│ rgba.png
│
├───train_008
│ metadata.json
│ rgba.png
│
├───train_009
│ metadata.json
│ rgba.png
│
└───val_000
metadata.json
rgba.png
xiumingzhang commented
Sorry for the delayed response. Have you solved the problem? This issue seems to be related to specific values in your data. Please feel free to reopen this if you need further help.
hongsiyu commented
@CorneliusHsiao I also have the same problems as you. Have you solved it?