jackroos/VL-BERT

RuntimeError:CUDA out of memory

liulijie-2020 opened this issue · 2 comments

when have trained a fine-tuned VCR model nearly approaching a day, the error happened.
Traceback (most recent call last): File "vcr/train_end2end.py", line 59, in <module> main() File "vcr/train_end2end.py", line 53, in main rank, model = train_net(args, config) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../vcr/function/train.py", line 337, in train_net gradient_accumulate_steps=config.TRAIN.GRAD_ACCUMULATE_STEPS) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/trainer.py", line 115, in train outputs, loss = net(*batch) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 376, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/module.py", line 22, in forward return self.train_forward(*inputs, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../vcr/modules/resnet_vlbert_for_vcr.py", line 340, in train_forward output_text_and_object_separately=True) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/nlp/time_distributed.py", line 35, in forward reshaped_outputs = self._module(*reshaped_inputs, **kwargs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../common/visual_linguistic_bert.py", line 140, in forward output_attention_probs=output_attention_probs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../external/pytorch_pretrained_bert/modeling.py", line 410, in forward hidden_states = layer_module(hidden_states, attention_mask, output_attention_probs=output_attention_probs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../external/pytorch_pretrained_bert/modeling.py", line 392, in forward intermediate_output = self.intermediate(attention_output) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/project/VLbert/VL-BERT-master/vcr/../external/pytorch_pretrained_bert/modeling.py", line 362, in forward hidden_states = self.dense(hidden_states) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__ result = self.forward(*input, **kwargs) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/modules/linear.py", line 92, in forward return F.linear(input, self.weight, self.bias) File "/home/songzijie/.conda/envs/vl-bert/lib/python3.6/site-packages/torch/nn/functional.py", line 1408, in linear output = input.matmul(weight.t()) RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 1; 11.91 GiB total capacity; 10.43 GiB already allocated; 12.06 MiB free; 348.09 MiB cached)
I tryed training by 2 or 3 GPUs.
And tryed to reduced Batch by changing LOG_FREQUENT from 100 to 2, but no use.
The error still happened in one day train.
I hope can get some help for it.

The LOG_FREQUENT is just logging frequency, you need to reduce batch size by changing the option "BATCH_IMAGES".

The LOG_FREQUENT is just logging frequency, you need to reduce batch size by changing the option "BATCH_IMAGES".

Thank you very much. It worked out.
I have changed "BATCH_IMAGES" to 2, and used 4 gpus.
But when it was training, the ID0 Memory-Usage still raised from 7692MiB / 12212MiB to 10082MiB / 12210MiB by several hours.
I'm worry about if it will be out of memory.
Unfortunately. Something I was worried about happened.ID0 Memory-Usage has raised to 11362MiB / 12210MiB after another 3 hours.