nouhadziri/THRED

NaN error when clip gradients.

Opened this issue · 0 comments

Hi,
The Vanilla Seq2Seq and HRED models report a "NaN tensor error" at the first training step.

The error code is clipped_grads, grad_norm = tf.clip_by_global_norm(self.gradients, params.max_gradient_norm) in hred_model.py.

How can I solve this problem?

P.S.

  • use embedding : random300
  • tensorfolw-gpu: 1.12.1
  • 3-turn dataset
  • THRED and TA-Seq2Seq work well

It tracebacks:

Traceback (most recent call last):
File "/usr/local/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/usr/local/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/HRED/thred/main.py", line 6, in
tf.app.run(main=thred_main)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "/data/HRED/thred/main.py", line 45, in main
model.train()
File "/data/HRED/thred/models/hierarchical_base.py", line 132, in train
step_result = loaded_train_model.train(train_sess)
File "/data/HRED/thred/models/hred/hred_model.py", line 446, in train
self.learning_rate])
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node hred_graph/VerifyFinite/CheckNumerics (defined at /data/HRED/thred/models/hred/hred_model.py:131) = CheckNumericsT=DT_FLOAT, _class=["loc:@hred_graph/VerifyFinite/control_dependency"], message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node hred_graph/clip_by_global_norm/mul/_187}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_3642_hred_graph/clip_by_global_norm/mul", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]