OOM issue during training of classifier on GPU instance

Question

OOM issue during training of classifier on GPU instance

Closed this issue 4 years ago · 2 comments

Hi there,

I am trying to train own models based on the provided template and it works well on a CPU machine with small training data.

However, when enlarging the training data (>10 sentences or even >100k sentences) I receive an OOM error message. This seems to always happen on GPU instances (tried on AWS with up to G3.16xlarge https://aws.amazon.com/de/ec2/instance-types/g3/ ).

Here is the part of the error message:

2020-08-22T12:59:27.126+02:00 | 2020-08-22 10:59:27.126292: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 48.00MiB (rounded to 50331648)
-- | --
  | 2020-08-22T12:59:27.126+02:00 | Current allocation summary follows.
  | 2020-08-22T12:59:27.126+02:00 | 2020-08-22 10:59:27.126394: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
  | 2020-08-22T12:59:27.126+02:00 | 2020-08-22 10:59:27.126408: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256): Total Chunks: 56, Chunks in use: 56. 14.0KiB allocated for chunks. 14.0KiB in use in bin. 384B client-requested in use in bin.

...


2020-08-22T12:59:27.319+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 331, in optimize
-- | --
  | 2020-08-22T12:59:27.319+02:00 | func, n_trials, timeout, catch, callbacks, gc_after_trial, None
  | 2020-08-22T12:59:27.319+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 626, in _optimize_sequential
  | 2020-08-22T12:59:27.319+02:00 | self._run_trial_and_callbacks(func, catch, callbacks, gc_after_trial)
  | 2020-08-22T12:59:27.319+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 656, in _run_trial_and_callbacks
  | 2020-08-22T12:59:27.320+02:00 | trial = self._run_trial(func, catch, gc_after_trial)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 677, in _run_trial
  | 2020-08-22T12:59:27.320+02:00 | result = func(trial)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/txtclassification/fine_tune_absa.py", line 236, in objective
  | 2020-08-22T12:59:27.320+02:00 | return experiment(local_folder_name=local_folder_name, **params)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/txtclassification/fine_tune_absa.py", line 160, in experiment
  | 2020-08-22T12:59:27.320+02:00 | test_dataset, callbacks, strategy)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/classifier.py", line 60, in train_classifier
  | 2020-08-22T12:59:27.320+02:00 | callbacks=callbacks
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/routines.py", line 29, in train
  | 2020-08-22T12:59:27.320+02:00 | train_loop(train_step, train_dataset, callbacks, strategy)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/routines.py", line 44, in train_loop
  | 2020-08-22T12:59:27.320+02:00 | train_step_outputs = step(tf_batch)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/routines.py", line 62, in one_device
  | 2020-08-22T12:59:27.320+02:00 | return strategy.experimental_run_v2(step, args=batch)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
  | 2020-08-22T12:59:27.321+02:00 | return func(*args, **kwargs)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 957, in experimental_run_v2
  | 2020-08-22T12:59:27.321+02:00 | return self.run(fn, args=args, kwargs=kwargs, options=options)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/one_device_strategy.py", line 182, in run
  | 2020-08-22T12:59:27.321+02:00 | return super(OneDeviceStrategy, self).run(fn, args, kwargs, options)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
  | 2020-08-22T12:59:27.321+02:00 | return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
  | 2020-08-22T12:59:27.322+02:00 | return self._call_for_each_replica(fn, args, kwargs)
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/one_device_strategy.py", line 362, in _call_for_each_replica
  | 2020-08-22T12:59:27.322+02:00 | return fn(*args, **kwargs)
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 282, in wrapper
  | 2020-08-22T12:59:27.322+02:00 | return func(*args, **kwargs)
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/classifier.py", line 31, in train_step
  | 2020-08-22T12:59:27.322+02:00 | training=True
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/models.py", line 147, in call
  | 2020-08-22T12:59:27.322+02:00 | **bert_kwargs
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.323+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 572, in call
  | 2020-08-22T12:59:27.323+02:00 | encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.323+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 378, in call
  | 2020-08-22T12:59:27.323+02:00 | layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.324+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 356, in call
  | 2020-08-22T12:59:27.324+02:00 | intermediate_output = self.intermediate(attention_output)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.324+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 322, in call
  | 2020-08-22T12:59:27.324+02:00 | hidden_states = self.intermediate_act_fn(hidden_states)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.325+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/core.py", line 420, in call
  | 2020-08-22T12:59:27.325+02:00 | return self.activation(inputs)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 65, in gelu
  | 2020-08-22T12:59:27.325+02:00 | cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1010, in r_binary_op_wrapper
  | 2020-08-22T12:59:27.325+02:00 | return func(x, y, name=name)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1276, in _add_dispatch
  | 2020-08-22T12:59:27.325+02:00 | return gen_math_ops.add_v2(x, y, name=name)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 480, in add_v2
  | 2020-08-22T12:59:27.326+02:00 | _ops.raise_from_not_ok_status(e, name)
  | 2020-08-22T12:59:27.326+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
  | 2020-08-22T12:59:27.327+02:00 | six.raise_from(core._status_to_exception(e.code, message), None)
  | 2020-08-22T12:59:27.327+02:00 | File "<string>", line 3, in raise_from
  | 2020-08-22T12:59:27.327+02:00 | tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,128,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:AddV2]

Would be great if you could provide any suggestions how to solve.

Thank you and best,
Tobias

Answer 1 · 2020-08-24T09:17:06.000Z

Hey Tobias,
if you have problems with the OOM error, you need to inspect how your training batches look like ☺️

reduce the batch size
reduce the input sequence length (too long texts => use a text splitter, take a look how the Pipeline works)
distribute the training (few GPUs)

I think you should check the training dataset and remove/trim outliers (in terms of the input sequence length). In other words, your training dataset probably contains texts that are too long. This is a common problem of processing sequences (you need to adjust a batch to the longest example and then we have problems allocating a tensor even if there are mainly zeros).

This is my quick intuition. You need to provide more details ☺️

Answer 2 · 2020-08-24T18:37:08.000Z

Thanks!
I tried reducing the batch size before but this did not change a lot. The sequence length was a good intuition ;) Reducing it to max 30 tokens allows much larger datasets.

Best,
Tobias