having problems with OOM

Question

having problems with OOM

mustardlove opened this issue 5 years ago · 5 comments

Hello, Mr.Volk
Thank you very much for your nice codes!
I have one question for you

I'm new to deep learning, have only basic understanding about keras codes, and currently trying to run your DSOD_train.py.
Problem is, I keep getting OOM errors while executing the "Train" section of the code (error message below)

I tried to use only one GPU out of two I have, and to use 'allow_growth' option in tensorflow, and neither worked
I believe I need to reduce the size of minibatch(guess your code using batch size 128, am I right?), but I have no idea where to find the code to make this change. (just changing batch_size = 26 to some number lower didn't solve the problem, so I searched your .py files, ended up with no clue)
I'd really appreciate your help on my problem

By the way, I'm using Ubuntu 16.04 and latest tensorflow-keras

------------------------------------------------error message

ResourceExhaustedError Traceback (most recent call last)
in
49 workers=1,
50 #use_multiprocessing=False,
---> 51 initial_epoch=initial_epoch)

/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py in wrapper(*args, **kwargs)
89 warnings.warn('Update your ' + object_name + ' call to the ' +
90 'Keras 2 API: ' + signature, stacklevel=2)
---> 91 return func(*args, **kwargs)
92 wrapper._original_function = func
93 return wrapper

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in fit_generator(self, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
1416 use_multiprocessing=use_multiprocessing,
1417 shuffle=shuffle,
-> 1418 initial_epoch=initial_epoch)
1419
1420 @interfaces.legacy_generator_methods_support

/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py in fit_generator(model, generator, steps_per_epoch, epochs, verbose, callbacks, validation_data, validation_steps, class_weight, max_queue_size, workers, use_multiprocessing, shuffle, initial_epoch)
215 outs = model.train_on_batch(x, y,
216 sample_weight=sample_weight,
--> 217 class_weight=class_weight)
218
219 outs = to_list(outs)

/usr/local/lib/python3.6/dist-packages/keras/engine/training.py in train_on_batch(self, x, y, sample_weight, class_weight)
1215 ins = x + y + sample_weights
1216 self._make_train_function()
-> 1217 outputs = self.train_function(ins)
1218 return unpack_singleton(outputs)
1219

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in call(self, inputs)
2713 return self._legacy_call(inputs)
2714
-> 2715 return self._call(inputs)
2716 else:
2717 if py_any(is_tensor(x) for x in inputs):

/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py in _call(self, inputs)
2673 fetched = self._callable_fn(*array_vals, run_metadata=self.run_metadata)
2674 else:
-> 2675 fetched = self._callable_fn(*array_vals)
2676 return fetched[:len(self.outputs)]
2677

/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in call(self, *args, **kwargs)
1456 ret = tf_session.TF_SessionRunCallable(self._session._session,
1457 self._handle, args,
-> 1458 run_metadata_ptr)
1459 if run_metadata:
1460 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)

ResourceExhaustedError: 2 root error(s) found.
(0) Resource exhausted: OOM when allocating tensor with shape[6,1376,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node batch_normalization_302/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

 [[loss_5/mul/_21899]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

(1) Resource exhausted: OOM when allocating tensor with shape[6,1376,32,32] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
[[{{node batch_normalization_302/FusedBatchNorm}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

0 successful operations.
0 derived errors ignored.

Answer 1 · 2019-07-11T11:18:29.000Z

In DSOD_train.ipynb, batch size is actually 6 and the gradients get accumulated with AdamAccumulate for 128//6 batches before a gradient update is performed. This results in a virtual batch size of 126, but the log is updateted after each batch.

Setting the batch size to 4 or even 2 should solve the issue. How large is your GPU memory?

Answer 2 · 2019-07-12T03:18:19.000Z

Thank you so much for your kind help!
I changed the 512's batch size to 4 and the train code is running!

I'm using two Titan Xp GPU and the memory spec is as follows:
11.4 GbpsMemory Speed
12 GB GDDR5XStandard Memory Config
384-bitMemory Interface Width
547.7 GB/sMemory Bandwidth (GB/sec)

currently the execution is using only 1 GPU..don't know why

I have one more question!

In your data_coco.py, there is convert_to_voc function.
I'm only using COCO dataset, so in DSOD_trian, I commented out codes related to VOC dataset and did
gt_util_train = gt_util_coco.convert_to_voc()
gt_util_val = gt_util_coco_val.convert_to_voc()
Does this code make DSOD_Train to train on only 21 categories? I figured you only have 21 initial weights.

Answer 3 · 2019-07-12T08:10:57.000Z

I've always used 1 GPU for training a model, but it should work with multiple GPUs as well. The documentation of Model.fit_generator() explains how to do this.

convert_to_voc in the COCO case returns a new GTUtility with COCO data, but with the 20 (21 including background) VOC classes leading to a model with 21 categories.

The weights you mentioned are not trainable parameters... See #14 for more details.

Answer 4 · 2019-07-15T02:51:44.000Z

Thank you for the reply!

I played some parameters in fit_generator() (use_multiprocessing=True, workers=2) but still only one gpu was on.

I also tried using multi_gpu_model from keras.utils, but failed with _TfDeviceCaptureOp does not have method _set_device_from_string.
I found that the class _TfDeviceCaptureOp in tensorflow/python/keras/backend.py does have _set_device_from_string, but in keras/backend/tensorflow_backend.py does not have that method..

If anyone solved this issue, please share your knowledge
Thank you!

Answer 5 · 2019-09-05T16:39:58.000Z

Search for keras.utils.multi_gpu_model, use_multiprocessing=True, workers=2 refers to data loading.