Knet GPU OOM with recent CUDA/Knet updates
Closed this issue · 5 comments
With the recent CUDA upgrade to 1.3, I'm running into OOM problems with my models. I can't seem to revert CUDA 1.2.1 with Knet 1.4.
Attached is a MWE.
train_unet.txt
layers.txt
resunet.txt
Layers are just taken from KnetLayers and changed so that Conv3D operations are supported, as well as recent changes to Knet 1.4.
Working:
Knet 1.3.9
CUDA 1.2.1
Julia 1.4
Not working:
Knet 1.4
CUDA 1.3
Julia 1.4
Running on Julia 1.4 b/c of #589, which this MWE reproduces.
I could not replicate #589 with any version combination I tried.
For this example, I tried CUDA 1.2.1, 1.3, Knet 1.3.9, 1.4, and Julia 1.4, 1.5 with always the same oom result (on a 6GB RAM GPU).
Oh, I have a 12GB GPU and my model sizes are pretty large, hence my attention to GPU memory performances... (in my MWE for instance I'm using sizes of 176x176x176...)
If you change
31 x = rand(Float32, 176, 176, 176, 1, 1)
32 y = rand(Float32, 176, 176, 176, 1, 1)
to
31 x = rand(Float32, 32, 32, 32, 1, 1)
32 y = rand(Float32, 32, 32, 32, 1, 1)
I tend to run at the limits of my Tesla K80 in terms of memory, so perhaps a smaller size that's right on the edge of OOM might reproduce....
Did the problem improve with the 1.4.1 fixes?
Well, I closed it since it seemed to be more related to CUDA.jl, and because it was hard to consistently reproduce, especially with all the changes. Haven't really been able to push things with 1.4.1 due to #610
1.4.1 fixed some memory related issues and it should be able to train 50% larger models/batch sizes. That's why I was curious. I will look at #610.