RuntimeError: CUDA out of memory

Question

RuntimeError: CUDA out of memory

Opened this issue 5 years ago · 8 comments

First of all, thank you for your wonderful work.
I am training animation.py and after scale 7 I am getting this error. How can I solve it? Thanks!

scale 7:[1975/2000]
scale 7:[1999/2000]
GeneratorConcatSkip2CleanAdd(
(head): ConvBlock(
(conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
(body): Sequential(
(block1): ConvBlock(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
(block2): ConvBlock(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
(block3): ConvBlock(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
)
(tail): Sequential(
(0): Conv2d(128, 3, kernel_size=(3, 3), stride=(1, 1))
(1): Tanh()
)
)
WDiscriminator(
(head): ConvBlock(
(conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
(body): Sequential(
(block1): ConvBlock(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
(block2): ConvBlock(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
(block3): ConvBlock(
(conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1))
(norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True)
)
)
(tail): Conv2d(128, 1, kernel_size=(3, 3), stride=(1, 1))
)
Traceback (most recent call last):
File "main_train.py", line 29, in
train(opt, Gs, Zs, reals, NoiseAmp)
File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 39, in train
z_curr,in_s,G_curr = train_single_scale(D_curr,G_curr,reals,Gs,Zs,in_s,NoiseAmp,opt)
File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 162, in train_single_scale
gradient_penalty.backward()
File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\tensor.py", line 166, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\autograd_init_.py", line 99, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 2.00 GiB total capacity; 1.14 GiB already allocated; 9.49 MiB free; 177.34 MiB cached)

Answer 1 · 2019-12-16T10:29:37.000Z

You’re GPU doesn’t have enough memory for this, so either train a smaller image or move to a bigger GPU.

…

On 16 Dec 2019, at 04:43, Ahmadxon ***@***.***> wrote: First of all, thank you for your wonderful work. I am training animation.py and after scale 7 I am getting this error. How can I solve it? Thanks! scale 7:[1975/2000] scale 7:[1999/2000] GeneratorConcatSkip2CleanAdd( (head): ConvBlock( (conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Sequential( (0): Conv2d(128, 3, kernel_size=(3, 3), stride=(1, 1)) (1): Tanh() ) ) WDiscriminator( (head): ConvBlock( (conv): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (body): Sequential( (block1): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block2): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) (block3): ConvBlock( (conv): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (norm): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (LeakyRelu): LeakyReLU(negative_slope=0.2, inplace=True) ) ) (tail): Conv2d(128, 1, kernel_size=(3, 3), stride=(1, 1)) ) Traceback (most recent call last): File "main_train.py", line 29, in train(opt, Gs, Zs, reals, NoiseAmp) File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 39, in train z_curr,in_s,G_curr = train_single_scale(D_curr,G_curr,reals,Gs,Zs,in_s,NoiseAmp,opt) File "C:\Users\Wooks\Source\ml_khan_20185057\SinGAN\SinGAN\training.py", line 162, in train_single_scale gradient_penalty.backward() File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\tensor.py", line 166, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "C:\Users\Wooks\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\autograd_init_.py", line 99, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: CUDA out of memory. Tried to allocate 22.00 MiB (GPU 0; 2.00 GiB total capacity; 1.14 GiB already allocated; 9.49 MiB free; 177.34 MiB cached) — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Answer 2 · 2019-12-28T08:56:22.000Z

Hello! I encountered the same problem when running the main_train.py file, but it appeared after adding a layer of attention mechanism to the network of the generator and the discriminator. I did not encounter any problems when running the original code. Does adding only a layer of attention mechanism cause insufficient GPU memory? Thank you and wish you a happy life!

Answer 3 · 2020-03-15T04:47:19.000Z

@markstrefford ...ran into a similar issue; have 6 GiB memory - training on a 1024x1024 pixels image...

Answer 4 · 2020-06-07T05:23:45.000Z

Hello! I encountered the same problem when running the main_train.py file, but it appeared after adding a layer of attention mechanism to the network of the generator and the discriminator. I did not encounter any problems when running the original code. Does adding only a layer of attention mechanism cause insufficient GPU memory? Thank you and wish you a happy life!

Attention layers consume a lot of memory. You can try using pooling or another mechanism to reduce the attention matrix size to reduce the memory usage

Answer 5 · 2020-06-08T02:23:38.000Z

@victorca25 Thank you for your idea, which has benefited me a lot. Wish you a happy life！

Answer 6 · 2020-10-30T11:50:24.000Z

I want to know why the memory is increasing when training model on a finer scale. Because parameters of previous model is fixed, so I wonder about the increasing memory.

Answer 7 · 2021-03-20T16:34:00.000Z

@ahmadxon : How did you solve the "out of memory" error?

Answer 8 · 2021-04-09T02:33:22.000Z

@ahmadxon : How did you solve the "out of memory" error?

I just used Google Colab platform and runed there.