EdinburghNLP/nematus

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

shiny1022 opened this issue · 3 comments

I ues the tensorflow-based nematus. when I run the train.sh , the memory of the GPU is fully occupied. But the data set is not very large. The version of tensorflow==1.4.0,cuda=8.0,cudnn=5.1.
I set the batch size=64,but the problem remains unsolved. Does anyone know how to solve this problem?

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51,64,74313]
[[Node: decoder/hidden_to_logits/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/hidden_to_logits/Reshape_1, decoder/next_word_predictor/hidden_to_logits/b/read)]]
[[Node: loss/sparse_softmax_cross_entropy_loss/assert_broadcastable/AssertGuard/Assert/Switch/_767 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11437_loss/sparse_softmax_cross_entropy_loss/assert_broadcastable/AssertGuard/Assert/Switch", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'decoder/hidden_to_logits/add', defined at:
File "/home/moses/NLP/workspace/mt/nematus/train.py", line 618, in
train(config, sess)
File "/home/moses/NLP/workspace/mt/nematus/train.py", line 86, in train
replicas.append(rnn_model.RNNModel(config))
File "/home/moses/NLP/workspace/mt/nematus/rnn_model.py", line 68, in init
self.logits = self.decoder.score(self.inputs.y)
File "/home/moses/NLP/workspace/mt/nematus/rnn_model.py", line 229, in score
logits = self.predictor.get_logits(y_embs, states, attended_states, multi_step=True)
File "/home/moses/NLP/workspace/mt/nematus/rnn_model.py", line 317, in get_logits
logits = self.hidden_to_logits.forward(hidden, input_is_3d=multi_step)
File "/home/moses/NLP/workspace/mt/nematus/layers.py", line 83, in forward
y = matmul3d(x, self.W) + self.b
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
return func(x, y, name=name)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 183, in add
"Add", x=x, y=y, name=name)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

GPU memory consumption does not depend on the size of the dataset, but on the network size (number and size of hidden layers, size of embedding layer and vocabulary size), and some training parameters (maximum sentence length, batch size).

You can reduce these parameters to train your model. Note that this may also reduce quality, and I wouldn't recommend trying to train models with less than 8GB of GPU memory.

Thanks for your advice. But the memory of the GPU is 11GB. I think it's enough. The train.sh is downloaded from WMT2017. The default batch_size = 80. I don't think this parameter is large. Is it possible that something is wrong elsewhere?