Knet GPU OOM with recent CUDA/Knet updates

Question

Knet GPU OOM with recent CUDA/Knet updates

Closed this issue 4 years ago · 5 comments

With the recent CUDA upgrade to 1.3, I'm running into OOM problems with my models. I can't seem to revert CUDA 1.2.1 with Knet 1.4.

Attached is a MWE.
train_unet.txt
layers.txt
resunet.txt

Layers are just taken from KnetLayers and changed so that Conv3D operations are supported, as well as recent changes to Knet 1.4.

Working:
Knet 1.3.9
CUDA 1.2.1
Julia 1.4

Not working:
Knet 1.4
CUDA 1.3
Julia 1.4

Running on Julia 1.4 b/c of #589, which this MWE reproduces.

Answer 1 · 2020-08-22T15:17:21.000Z

I could not replicate #589 with any version combination I tried.

For this example, I tried CUDA 1.2.1, 1.3, Knet 1.3.9, 1.4, and Julia 1.4, 1.5 with always the same oom result (on a 6GB RAM GPU).

Answer 2 · 2020-08-22T17:41:13.000Z

Oh, I have a 12GB GPU and my model sizes are pretty large, hence my attention to GPU memory performances... (in my MWE for instance I'm using sizes of 176x176x176...)

If you change

31         x = rand(Float32, 176, 176, 176, 1, 1)
32         y = rand(Float32, 176, 176, 176, 1, 1)

to

31         x = rand(Float32, 32, 32, 32, 1, 1)
32         y = rand(Float32, 32, 32, 32, 1, 1)

I tend to run at the limits of my Tesla K80 in terms of memory, so perhaps a smaller size that's right on the edge of OOM might reproduce....

Answer 3 · 2020-08-31T03:12:35.000Z

Did the problem improve with the 1.4.1 fixes?

Answer 4 · 2020-08-31T12:24:08.000Z

Well, I closed it since it seemed to be more related to CUDA.jl, and because it was hard to consistently reproduce, especially with all the changes. Haven't really been able to push things with 1.4.1 due to #610

Answer 5 · 2020-09-01T06:27:17.000Z

1.4.1 fixed some memory related issues and it should be able to train 50% larger models/batch sizes. That's why I was curious. I will look at #610.