Found Inf or NaN global norm. : Tensor had NaN values

Question

Found Inf or NaN global norm. : Tensor had NaN values

Closed this issue 5 years ago · 1 comments

ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching$ PYTHONPATH=. python3.6 abcnn/train.py
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.791 seconds.
Prefix dict has been built succesfully.
{'1': 138574, '0': 100192}
WARNING:tensorflow:From /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/data_prepare.py:54: VocabularyProcessor.init (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:154: CategoricalVocabulary.init (from tensorflow.contrib.learn.python.learn.preprocessing.categorical_vocabulary) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/learn/python/learn/preprocessing/text.py:170: tokenizer (from tensorflow.contrib.learn.python.learn.preprocessing.text) is deprecated and will be removed in a future version.
Instructions for updating:
Please use tensorflow/transform or tf.data.
{'1': 4402, '0': 4400}
WARNING:tensorflow:From /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py:99: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
2019-06-21 12:59:01.301229: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-06-21 12:59:01.402848: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:964] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-06-21 12:59:01.403378: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.6575
pciBusID: 0000:01:00.0
totalMemory: 10.92GiB freeMemory: 8.62GiB
2019-06-21 12:59:01.403403: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-06-21 12:59:01.801177: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-06-21 12:59:01.801215: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-06-21 12:59:01.801223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-06-21 12:59:01.801521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8324 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:01:00.0, compute capability: 6.1)
training 1>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
0it [00:00, ?it/s]2019-06-21 12:59:04.872182: E tensorflow/core/kernels/check_numerics_op.cc:185] abnormal_detected_host @0x7f5822809400 = {1, 0} Found Inf or NaN global norm.

Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[{{node model/VerifyFinite/CheckNumerics}} = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node model/clip_by_global_norm/mul_1/_359}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5200_model/clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "abcnn/train.py", line 122, in
train.trainModel()
File "abcnn/train.py", line 77, in trainModel
_, cost, accuracy = sess.run([model.train_op, model.loss, model.accuracy], feed_dict)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Found Inf or NaN global norm. : Tensor had NaN values
[[node model/VerifyFinite/CheckNumerics (defined at /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py:226) = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node model/clip_by_global_norm/mul_1/_359}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5200_model/clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'model/VerifyFinite/CheckNumerics', defined at:
File "abcnn/train.py", line 122, in
train.trainModel()
File "abcnn/train.py", line 56, in trainModel
d0=con.embedding_size, di=50, num_classes=2, num_layers=2)
File "/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py", line 226, in init
grads, _ = tf.clip_by_global_norm(tf.gradients(self.loss, tvars), 5)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/clip_ops.py", line 265, in clip_by_global_norm
"Found Inf or NaN global norm.")
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/numerics.py", line 47, in verify_tensor_all_finite
verify_input = array_ops.check_numerics(t, message=msg)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 817, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InvalidArgumentError (see above for traceback): Found Inf or NaN global norm. : Tensor had NaN values
[[node model/VerifyFinite/CheckNumerics (defined at /media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching/abcnn/abcnn_mdoel.py:226) = CheckNumericsT=DT_FLOAT, message="Found Inf or NaN global norm.", _device="/job:localhost/replica:0/task:0/device:GPU:0"]]
[[{{node model/clip_by_global_norm/mul_1/_359}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_5200_model/clip_by_global_norm/mul_1", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

ub16c9@ub16c9-gpu:/media/ub16c9/fcd84300-9270-4bbd-896a-5e04e79203b7/ub16_prj/text_matching$

Answer 1 · 2019-06-26T11:20:19.000Z

thank you, i will check it as soon as possible