ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)

Question

ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)

andevellicus opened this issue 4 years ago · 7 comments

With the recent Knet 1.4.1 update I randomly get the following error:

ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)
Closest candidates are:
  unsafe_free!(::CuArray) at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/array.jl:35
  unsafe_free!(::CUDA.CUSPARSE.CuSparseVector) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cusparse/array.jl:34
  unsafe_free!(::CUDA.CUFFT.CuFFTPlan) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cufft/fft.jl:27
Stacktrace:
 [1] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/andevellicus/.julia/packages/Knet/8aEsn/src/autograd_gpu/gcnode.jl:88
 [2] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:168
 [3] differentiate at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
 [4] minimize!(::SEUNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at xxx/train.jl:14
 [5] main() at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/train.jl:85
 [6] top-level scope at xxx/train.jl:109
 [7] include(::Function, ::Module, ::String) at ./Base.jl:380
 [8] include(::Module, ::String) at ./Base.jl:368
 [9] exec_options(::Base.JLOptions) at ./client.jl:296
 [10] _start() at ./client.jl:506
in expression starting at xxx/train.jl:109

It isn't consistent, but seems to pop up at random points during batch training.

Answer 1 · 2020-09-01T06:40:20.000Z

I do not quite understand why this can happen so it would be nice to have an MWE.

However I implemented a fix with Pkg.add("Knet#dy/610") which will probably fix it. Let me know if this works.

Answer 2 · 2020-09-01T12:12:57.000Z

That fixed it from crashing. I do get this though:

┌ Warning: gcnode error: c=Nothing v=230 ni=230
└ @ Knet.AutoGrad_gpu ~/.julia/packages/Knet/OEZWM/src/autograd_gpu/gcnode.jl:90

I'll see if I can get a MWE up.

Answer 3 · 2020-09-01T15:06:07.000Z

Not sure if this is at all helpful, but that warning message appears only once. In one instance it showed up in Epoch 1, in another instance with a smaller model it showed up in Epoch 3, and never again.

Answer 4 · 2020-09-02T09:11:31.000Z

There is a `maxlog=1` setting for this warning, so it can appear at most once. This does not mean it does not happen more often. It probably means some arrays are getting garbage collected before I can free them, which by itself is not a problem, but I'd like to make sure with a MWE.

…

On Tue, Sep 1, 2020 at 6:06 PM andevellicus ***@***.***> wrote: Not sure if this is at all helpful, but that warning message appears only once. In one instance it showed up in Epoch 1, in another instance with a smaller model it should up in Epoch 3, and never again. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#610 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAN43JQV5LS4WCCZVGUJR5LSDUEW7ANCNFSM4QORPCJA> .

Answer 5 · 2020-09-03T00:34:29.000Z

MWE:

using Knet
using CUDA

setoptim!(m, optimizer) = for p in params(m); p.opt = Knet.clone(optimizer); end

dice(x, y; smooth::Float32=1.f0) = (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)
loss(x, y) = 1 - dice(x, y)

function minimize!(model, x::KnetArray, y::KnetArray)
    ld = @diff loss(model(x), y)
    for w in params(model)
	Knet.update!(w, grad(ld, w))
    end
    return value(ld)
end

# Define a chain of layers :
struct Chain; layers; end
(c::Chain)(x) = (for l in c.layers; x = l(x); end; x)

struct test_model; c; end
function (m::test_model)(x)
    x = m.c(x)
    return x
end

function test_model()
    w = param(3, 3, 3, 1, 8) 
    c = Chain((
	       x->conv4(w, x, stride=2, padding=1),
	       x->unpool(x),
	       x->conv4(w, x, stride=2, padding=1),
	       x->unpool(x),
	       x->conv4(w, x, stride=2, padding=1),
	       x->unpool(x)
	      ))
    test_model(c)
end

# Main training loop
function main()

    CUDA.device!(1)

    # Get model
    model = test_model()  
    setoptim!(model, Adam())

    # Kick off the training loop
    for i in 1:5 
	@info "Epoch $i of 5" 

	for i in 1:5 
	    x = rand(Float32, 448, 256, 256, 1, 1)
	    y = rand(Float32, 448, 256, 256, 1, 1)
	    train_loss = minimize!(model, KnetArray(x), KnetArray(y))
	end

	println("")
    end
end

main()

Setup is Julia 1.5.1, CUDA v1.3.3, Knet v1.4.1 https://github.com/denizyuret/Knet.jl.git#dy/610

This gives me that error on a K80 with 12 gigs of RAM per core -- you may have to adjust the size of the rand arrays depending on how much memory you have -- I basically just played with the numbers. If they're too small, the error doesn't show up, so it seems to be under high memory pressure.

Answer 6 · 2020-09-03T05:13:45.000Z

A question for @maleadt: I am doing eager gc with unsafe_free! during the backward pass here. I use WeakRefs to avoid hanging on to memory in my data structures used to figure out when an array can no longer be in use. Sometimes the values of these WeakRefs come back as nothing even though they were initialized as CuArray in the beginning. I assume this happens when the regular gc gets to these CuArrays before me, but just wanted to make sure: does the value of a WeakRef turn into nothing when the object is garbage collected?

Answer 7 · 2020-09-03T05:56:29.000Z

does the value of a WeakRef turn into nothing when the object is garbage collected

Correct. Ref JuliaLang/julia#26745