EnyaHermite/SPH3D-GCN

error:CUB segmented reduce errorinvalid device function

GaHooooo opened this issue · 1 comments

thanks for your great job! ! !
but i have some question :
when i train the modelnet40_cls , some error happen:

2020-11-15 20:56:49.388664
Traceback (most recent call last):
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
return fn(*args)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: CUB segmented reduce errorinvalid device function
[[{{node Max}} = Max[T=DT_FLOAT, Tidx=DT_INT32, keep_dims=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Sum, ArgMax/dimension)]]
[[{{node GroupCrossDeviceControlEdges_0/Adam/value/_82}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7776_GroupCrossDeviceControlEdges_0/Adam/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/gahho/SPH3D-GCN/modelnet40_cls/train_modelnet.py", line 376, in
train()
File "/home/gahho/SPH3D-GCN/modelnet40_cls/train_modelnet.py", line 247, in train
train_one_epoch(sess, ops, next_train_element, train_writer)
File "/home/gahho/SPH3D-GCN/modelnet40_cls/train_modelnet.py", line 292, in train_one_epoch
ops['train_op'], ops['loss'], ops['pred']], feed_dict=feed_dict)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: CUB segmented reduce errorinvalid device function
[[node Max (defined at /home/gahho/SPH3D-GCN/models/SPH3D_modelnet.py:13) = Max[T=DT_FLOAT, Tidx=DT_INT32, keep_dims=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Sum, ArgMax/dimension)]]
[[{{node GroupCrossDeviceControlEdges_0/Adam/value/_82}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7776_GroupCrossDeviceControlEdges_0/Adam/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op 'Max', defined at:
File "/home/gahho/SPH3D-GCN/modelnet40_cls/train_modelnet.py", line 376, in
train()
File "/home/gahho/SPH3D-GCN/modelnet40_cls/train_modelnet.py", line 161, in train
pred, end_points = MODEL.get_model(xyz_pl, training_pl, config=net_config)
File "/home/gahho/SPH3D-GCN/models/SPH3D_modelnet.py", line 42, in get_model
points = normalize_xyz(points)
File "/home/gahho/SPH3D-GCN/models/SPH3D_modelnet.py", line 13, in normalize_xyz
scale = tf.reduce_max(tf.reduce_sum(tf.square(points),axis=-1,keepdims=True),axis=1,keepdims=True)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 1643, in reduce_max
name=name))
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 4641, in _max
name=name)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/util/deprecation.py", line 488, in new_func
return func(*args, **kwargs)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3274, in create_op
op_def=op_def)
File "/home/gahho/anaconda3/envs/sph3dgcn/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1770, in init
self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): CUB segmented reduce errorinvalid device function
[[node Max (defined at /home/gahho/SPH3D-GCN/models/SPH3D_modelnet.py:13) = Max[T=DT_FLOAT, Tidx=DT_INT32, keep_dims=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Sum, ArgMax/dimension)]]
[[{{node GroupCrossDeviceControlEdges_0/Adam/value/_82}} = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_7776_GroupCrossDeviceControlEdges_0/Adam/value", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

when i set the batch_size=1,it doesn't cause the error,but the result is very bad
i don't know how to fix it
thanks for your reply

The problem might be caused by the tensorflow version you are using. We test the code in Tensorflow 1.12.