ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

Question

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape

shiny1022 opened this issue 6 years ago · 3 comments

I ues the tensorflow-based nematus. when I run the train.sh , the memory of the GPU is fully occupied. But the data set is not very large. The version of tensorflow==1.4.0,cuda=8.0,cudnn=5.1.
I set the batch size=64,but the problem remains unsolved. Does anyone know how to solve this problem?

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[51,64,74313]
[[Node: decoder/hidden_to_logits/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:GPU:0"](decoder/hidden_to_logits/Reshape_1, decoder/next_word_predictor/hidden_to_logits/b/read)]]
[[Node: loss/sparse_softmax_cross_entropy_loss/assert_broadcastable/AssertGuard/Assert/Switch/_767 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_11437_loss/sparse_softmax_cross_entropy_loss/assert_broadcastable/AssertGuard/Assert/Switch", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]]

Caused by op u'decoder/hidden_to_logits/add', defined at:
File "/home/moses/NLP/workspace/mt/nematus/train.py", line 618, in
train(config, sess)
File "/home/moses/NLP/workspace/mt/nematus/train.py", line 86, in train
replicas.append(rnn_model.RNNModel(config))
File "/home/moses/NLP/workspace/mt/nematus/rnn_model.py", line 68, in init
self.logits = self.decoder.score(self.inputs.y)
File "/home/moses/NLP/workspace/mt/nematus/rnn_model.py", line 229, in score
logits = self.predictor.get_logits(y_embs, states, attended_states, multi_step=True)
File "/home/moses/NLP/workspace/mt/nematus/rnn_model.py", line 317, in get_logits
logits = self.hidden_to_logits.forward(hidden, input_is_3d=multi_step)
File "/home/moses/NLP/workspace/mt/nematus/layers.py", line 83, in forward
y = matmul3d(x, self.W) + self.b
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 894, in binary_op_wrapper
return func(x, y, name=name)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 183, in add
"Add", x=x, y=y, name=name)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/moses/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

Answer 1 · 2018-12-09T18:19:25.000Z

GPU memory consumption does not depend on the size of the dataset, but on the network size (number and size of hidden layers, size of embedding layer and vocabulary size), and some training parameters (maximum sentence length, batch size).

You can reduce these parameters to train your model. Note that this may also reduce quality, and I wouldn't recommend trying to train models with less than 8GB of GPU memory.

Answer 2 · 2018-12-10T08:39:08.000Z

Thanks for your advice. But the memory of the GPU is 11GB. I think it's enough. The train.sh is downloaded from WMT2017. The default batch_size = 80. I don't think this parameter is large. Is it possible that something is wrong elsewhere?

Answer 3 · 2018-12-10T15:05:18.000Z

Hi, the WMT17 models have a lot of parameters. I get OOM on a 12 GB GPU training the en-de model. As Rico mentioned, there are a number of ways to reduce memory consumption, all with different model size / translation quality tradeoffs. I was able to get the model to train on a single GPU by reducing the vocabulary size and sharing the source and target embeddings. Here is a diff showing the change in training/scripts.tensorflow/preprocess.sh: 22c22 < bpe_operations=90000 ---

bpe_operations=40000

73c73 < # build network dictionary ---

# build network dictionaries for separate source / target vocabularies

75a76,79

# build network dictionary for combined source+target vocabulary (for use with # tied encoder-decoder embeddings) cat $data_dir/corpus.bpe.$src $data_dir/corpus.bpe.$trg > $data_dir/corpus.bpe.both $nematus_home/data/build_dictionary.py $data_dir/corpus.bpe.both

And for train.sh: 19c19 < --dictionaries $data_dir/corpus.bpe.$src.json $data_dir/corpus.bpe.$trg.json \ ---

--dictionaries $data_dir/corpus.bpe.both.json $data_dir/corpus.bpe.both.json \

33a34

--tie_encoder_decoder_embeddings \

This shouldn't have a big effect on translation quality (I compared against the original setup using 2 GPUs with a batch size of 40 on each and got very similar results), though you may need to scale the model back a bit further to get it to down to 11 GB.

…

On 10 Dec 2018, at 08:39, shiny1022 ***@***.***> wrote: Thanks for your advice. But the memory of the GPU is 11GB. I think it's enough. The train.sh is downloaded from WMT2017. The default batch_size = 80. I don't think this parameter is large. Is it possible that something is wrong elsewhere? — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#91 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABDaYwIiO2isw5CwJUpnco5Z8kltLmr2ks5u3h2tgaJpZM4ZCsZK>.