ScalaConsultants/Aspect-Based-Sentiment-Analysis

OOM issue during training of classifier on GPU instance

Closed this issue · 2 comments

Hi there,

I am trying to train own models based on the provided template and it works well on a CPU machine with small training data.

However, when enlarging the training data (>10 sentences or even >100k sentences) I receive an OOM error message. This seems to always happen on GPU instances (tried on AWS with up to G3.16xlarge https://aws.amazon.com/de/ec2/instance-types/g3/ ).

Here is the part of the error message:

2020-08-22T12:59:27.126+02:00 | 2020-08-22 10:59:27.126292: W tensorflow/core/common_runtime/bfc_allocator.cc:434] Allocator (GPU_0_bfc) ran out of memory trying to allocate 48.00MiB (rounded to 50331648)
-- | --
  | 2020-08-22T12:59:27.126+02:00 | Current allocation summary follows.
  | 2020-08-22T12:59:27.126+02:00 | 2020-08-22 10:59:27.126394: I tensorflow/core/common_runtime/bfc_allocator.cc:934] BFCAllocator dump for GPU_0_bfc
  | 2020-08-22T12:59:27.126+02:00 | 2020-08-22 10:59:27.126408: I tensorflow/core/common_runtime/bfc_allocator.cc:941] Bin (256): Total Chunks: 56, Chunks in use: 56. 14.0KiB allocated for chunks. 14.0KiB in use in bin. 384B client-requested in use in bin.

...


2020-08-22T12:59:27.319+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 331, in optimize
-- | --
  | 2020-08-22T12:59:27.319+02:00 | func, n_trials, timeout, catch, callbacks, gc_after_trial, None
  | 2020-08-22T12:59:27.319+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 626, in _optimize_sequential
  | 2020-08-22T12:59:27.319+02:00 | self._run_trial_and_callbacks(func, catch, callbacks, gc_after_trial)
  | 2020-08-22T12:59:27.319+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 656, in _run_trial_and_callbacks
  | 2020-08-22T12:59:27.320+02:00 | trial = self._run_trial(func, catch, gc_after_trial)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/optuna/study.py", line 677, in _run_trial
  | 2020-08-22T12:59:27.320+02:00 | result = func(trial)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/txtclassification/fine_tune_absa.py", line 236, in objective
  | 2020-08-22T12:59:27.320+02:00 | return experiment(local_folder_name=local_folder_name, **params)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/txtclassification/fine_tune_absa.py", line 160, in experiment
  | 2020-08-22T12:59:27.320+02:00 | test_dataset, callbacks, strategy)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/classifier.py", line 60, in train_classifier
  | 2020-08-22T12:59:27.320+02:00 | callbacks=callbacks
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/routines.py", line 29, in train
  | 2020-08-22T12:59:27.320+02:00 | train_loop(train_step, train_dataset, callbacks, strategy)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/routines.py", line 44, in train_loop
  | 2020-08-22T12:59:27.320+02:00 | train_step_outputs = step(tf_batch)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/routines.py", line 62, in one_device
  | 2020-08-22T12:59:27.320+02:00 | return strategy.experimental_run_v2(step, args=batch)
  | 2020-08-22T12:59:27.320+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/util/deprecation.py", line 324, in new_func
  | 2020-08-22T12:59:27.321+02:00 | return func(*args, **kwargs)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 957, in experimental_run_v2
  | 2020-08-22T12:59:27.321+02:00 | return self.run(fn, args=args, kwargs=kwargs, options=options)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/one_device_strategy.py", line 182, in run
  | 2020-08-22T12:59:27.321+02:00 | return super(OneDeviceStrategy, self).run(fn, args, kwargs, options)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 951, in run
  | 2020-08-22T12:59:27.321+02:00 | return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
  | 2020-08-22T12:59:27.321+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/distribute_lib.py", line 2290, in call_for_each_replica
  | 2020-08-22T12:59:27.322+02:00 | return self._call_for_each_replica(fn, args, kwargs)
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/distribute/one_device_strategy.py", line 362, in _call_for_each_replica
  | 2020-08-22T12:59:27.322+02:00 | return fn(*args, **kwargs)
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/autograph/impl/api.py", line 282, in wrapper
  | 2020-08-22T12:59:27.322+02:00 | return func(*args, **kwargs)
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/training/classifier.py", line 31, in train_step
  | 2020-08-22T12:59:27.322+02:00 | training=True
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/aspect_based_sentiment_analysis/models.py", line 147, in call
  | 2020-08-22T12:59:27.322+02:00 | **bert_kwargs
  | 2020-08-22T12:59:27.322+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.323+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 572, in call
  | 2020-08-22T12:59:27.323+02:00 | encoder_outputs = self.encoder([embedding_output, extended_attention_mask, head_mask], training=training)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.323+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 378, in call
  | 2020-08-22T12:59:27.323+02:00 | layer_outputs = layer_module([hidden_states, attention_mask, head_mask[i]], training=training)
  | 2020-08-22T12:59:27.323+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.324+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 356, in call
  | 2020-08-22T12:59:27.324+02:00 | intermediate_output = self.intermediate(attention_output)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.324+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 322, in call
  | 2020-08-22T12:59:27.324+02:00 | hidden_states = self.intermediate_act_fn(hidden_states)
  | 2020-08-22T12:59:27.324+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 968, in __call__
  | 2020-08-22T12:59:27.325+02:00 | outputs = self.call(cast_inputs, *args, **kwargs)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/keras/layers/core.py", line 420, in call
  | 2020-08-22T12:59:27.325+02:00 | return self.activation(inputs)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/transformers/modeling_tf_bert.py", line 65, in gelu
  | 2020-08-22T12:59:27.325+02:00 | cdf = 0.5 * (1.0 + tf.math.erf(x / tf.math.sqrt(2.0)))
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1010, in r_binary_op_wrapper
  | 2020-08-22T12:59:27.325+02:00 | return func(x, y, name=name)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/math_ops.py", line 1276, in _add_dispatch
  | 2020-08-22T12:59:27.325+02:00 | return gen_math_ops.add_v2(x, y, name=name)
  | 2020-08-22T12:59:27.325+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 480, in add_v2
  | 2020-08-22T12:59:27.326+02:00 | _ops.raise_from_not_ok_status(e, name)
  | 2020-08-22T12:59:27.326+02:00 | File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/framework/ops.py", line 6653, in raise_from_not_ok_status
  | 2020-08-22T12:59:27.327+02:00 | six.raise_from(core._status_to_exception(e.code, message), None)
  | 2020-08-22T12:59:27.327+02:00 | File "<string>", line 3, in raise_from
  | 2020-08-22T12:59:27.327+02:00 | tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[32,128,3072] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [Op:AddV2]

Would be great if you could provide any suggestions how to solve.

Thank you and best,
Tobias

Hey Tobias,
if you have problems with the OOM error, you need to inspect how your training batches look like ☺️

  1. reduce the batch size
  2. reduce the input sequence length (too long texts => use a text splitter, take a look how the Pipeline works)
  3. distribute the training (few GPUs)

I think you should check the training dataset and remove/trim outliers (in terms of the input sequence length). In other words, your training dataset probably contains texts that are too long. This is a common problem of processing sequences (you need to adjust a batch to the longest example and then we have problems allocating a tensor even if there are mainly zeros).

This is my quick intuition. You need to provide more details ☺️

Thanks!
I tried reducing the batch size before but this did not change a lot. The sequence length was a good intuition ;) Reducing it to max 30 tokens allows much larger datasets.

Best,
Tobias