wb14123/seq2seq-couplet

跑到660900之后,报NaN错误

fjibj opened this issue · 3 comments

fjibj commented

2019-08-06 00:52:38.476421: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7fb3e960d900 = {0, 1} Found Inf or
NaN global norm.Traceback (most recent call last):
File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/root/anaconda3/envs/fjpy36/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
[[{{node VerifyFinite/CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/r
eplica:0/task:0/device:GPU:0"
]] [[{{node clip_by_global_norm/mul_1/_301}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0
", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2818_clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"
]]

同问,有没有解决?

我是跑到696100之后出现了同样的问题,有大神懂怎么解决吗?

2020-02-11 16:21:58.843161: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f2456e15a00 = {0, 1} Found Inf or NaN global norm.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/opt/conda/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had Inf values
	 [[{{node VerifyFinite/CheckNumerics}} = CheckNumerics[T=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"](global_norm/global_norm)]]
	 [[{{node clip_by_global_norm/mul_1/_159}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2818_clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

我之前也偶尔会遇到同样的问题,一般解决办法就是从 checkpoint 继续训练。