FluxML/Flux.jl

cuda gpu memory usage increasing in time

Opened this issue · 3 comments

This issue has emerged multiple times on discord

https://discourse.julialang.org/t/memory-usage-increasing-with-each-epoch/121798
https://discourse.julialang.org/t/flux-memory-usage-high-in-srcnn/115174
https://discourse.julialang.org/t/out-of-memory-using-flux-cnn-during-back-propagation-phase/24492
https://discourse.julialang.org/t/flux-gpu-memory-problems/79783

and it could be related to #828 #302 #736 and JuliaGPU/CUDA.jl#137

This is a minimal example, involving only the forward pass, on Flux's master:

using Flux
using Statistics, Random

using CUDA

function train_mlp()
    d_in = 128
    d_out = 128
    batch_size = 128
    num_iters = 10
    device = gpu_device()
    
    model = Dense(d_in => d_out) |> device
    x = randn(Float32, d_in, batch_size) |> device
    for iter in 1:num_iters
        ŷ = model(x)
        @info iter
        # GC.gc(true)
        CUDA.pool_status()
    end
end

train_mlp()
# GC.gc(true)
# CUDA.raclaim()

with output

[ Info: 1
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 1.586 MiB (32.000 MiB reserved)
[ Info: 2
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.091 MiB (32.000 MiB reserved)
[ Info: 3
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 2.596 MiB (32.000 MiB reserved)
[ Info: 4
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.101 MiB (32.000 MiB reserved)
[ Info: 5
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 3.606 MiB (32.000 MiB reserved)
[ Info: 6
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.110 MiB (32.000 MiB reserved)
[ Info: 7
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 4.615 MiB (32.000 MiB reserved)
[ Info: 8
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.120 MiB (32.000 MiB reserved)
[ Info: 9
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 5.625 MiB (32.000 MiB reserved)
[ Info: 10
Effective GPU memory usage: 2.98% (716.688 MiB/23.465 GiB)
Memory pool usage: 6.130 MiB (32.000 MiB reserved)

Running multiple times train_mlp() the memory usage keeps ever increasing and more and more memory is reserved.

Mitigation strategies are to set memory limit like

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "10%"
ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "5%"

or to manually run the garbage collector

GC.gc(true)

which slows done a lot if done every iteration.

This behavior is highly problematic because training runs quickly fill the gpu and one cannot run other gpu processes.

cc @maleadt

The issue is likely with GC not being GPU-aware and not finalizing gpu arrays in time, so the memory just keeps growing, even though only a fraction of it is actually used.

Maybe with EscapeAnalysis and things like JuliaLang/julia#55990 situation can be improved, but I'm not sure if it can work effectively with things like Zygote (maybe @aviatesk can clarify).

The situation is worse if you render your desktop and run computations on the same GPU, so when you run out of memory in Julia you also crash your DE.


I recently experimented with allowing users to define a region of code where all gpu allocations are recorded and then bulk-freeing them once the program is out of that region in AMDGPU.

Example:

θ = <parameters>
AMDGPU.record_memory!(true)
∇ = gradient(θ) do θ
   ...
end
apply!(θ, ∇) # in-place parameter update
AMDGPU.record_memory!(false) # bulk-free all allocations that happened during recording.

It significantly improved memory usage with GaussianSplatting.jl:

Before After
image image

Maybe there are better approaches to this, but as an idea I think it can also easily be extended to Flux (e.g. with training API).

Also while setting

ENV["JULIA_CUDA_HARD_MEMORY_LIMIT"] = "10%"
ENV["JULIA_CUDA_SOFT_MEMORY_LIMIT"] = "5%"

does cap the maximum memory usage, it does not really improve the performance, since when you hit a limit we manually trigger GC and it can easily take 600+ ms where only ~10-20ms would be spent actually freeing gpu memory:
image

So recording memory allocations and bulk-freeing them also helps with this

I've experimented with yet another approach (JuliaGPU/AMDGPU.jl#708) that further significantly improves performance.
Instead of recording memory allocations and then bulk-freeing them, I've implemented caching memory allocator that keeps memory allocations in Julia (unless invalidated).
Which in some sense is similar to how PyTorch manages GPU allocations.

The pattern in the code should be similar to previous approach:
compute loss -> compute gradients -> in-place model update

Benchmarking on GaussianSplatting.jl 1k training steps we get a stable GPU memory consumption:

Memory recording (old approach) Caching allocator (new approach)
GPU memory utilization image image
Time 138.090182 seconds 93.672892 seconds

And here's the example of how to use it in the code.