denizyuret/Knet.jl

ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)

andevellicus opened this issue · 7 comments

With the recent Knet 1.4.1 update I randomly get the following error:

ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)
Closest candidates are:
  unsafe_free!(::CuArray) at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/array.jl:35
  unsafe_free!(::CUDA.CUSPARSE.CuSparseVector) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cusparse/array.jl:34
  unsafe_free!(::CUDA.CUFFT.CuFFTPlan) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cufft/fft.jl:27
Stacktrace:
 [1] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/andevellicus/.julia/packages/Knet/8aEsn/src/autograd_gpu/gcnode.jl:88
 [2] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:168
 [3] differentiate at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
 [4] minimize!(::SEUNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at xxx/train.jl:14
 [5] main() at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/train.jl:85
 [6] top-level scope at xxx/train.jl:109
 [7] include(::Function, ::Module, ::String) at ./Base.jl:380
 [8] include(::Module, ::String) at ./Base.jl:368
 [9] exec_options(::Base.JLOptions) at ./client.jl:296
 [10] _start() at ./client.jl:506
in expression starting at xxx/train.jl:109

It isn't consistent, but seems to pop up at random points during batch training.

I do not quite understand why this can happen so it would be nice to have an MWE.

However I implemented a fix with Pkg.add("Knet#dy/610") which will probably fix it. Let me know if this works.

That fixed it from crashing. I do get this though:

┌ Warning: gcnode error: c=Nothing v=230 ni=230
└ @ Knet.AutoGrad_gpu ~/.julia/packages/Knet/OEZWM/src/autograd_gpu/gcnode.jl:90

I'll see if I can get a MWE up.

Not sure if this is at all helpful, but that warning message appears only once. In one instance it showed up in Epoch 1, in another instance with a smaller model it showed up in Epoch 3, and never again.

MWE:

using Knet
using CUDA

setoptim!(m, optimizer) = for p in params(m); p.opt = Knet.clone(optimizer); end

dice(x, y; smooth::Float32=1.f0) = (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)
loss(x, y) = 1 - dice(x, y)

function minimize!(model, x::KnetArray, y::KnetArray)
    ld = @diff loss(model(x), y)
    for w in params(model)
	Knet.update!(w, grad(ld, w))
    end
    return value(ld)
end

# Define a chain of layers :
struct Chain; layers; end
(c::Chain)(x) = (for l in c.layers; x = l(x); end; x)

struct test_model; c; end
function (m::test_model)(x)
    x = m.c(x)
    return x
end

function test_model()
    w = param(3, 3, 3, 1, 8) 
    c = Chain((
	       x->conv4(w, x, stride=2, padding=1),
	       x->unpool(x),
	       x->conv4(w, x, stride=2, padding=1),
	       x->unpool(x),
	       x->conv4(w, x, stride=2, padding=1),
	       x->unpool(x)
	      ))
    test_model(c)
end

# Main training loop
function main()

    CUDA.device!(1)

    # Get model
    model = test_model()  
    setoptim!(model, Adam())

    # Kick off the training loop
    for i in 1:5 
	@info "Epoch $i of 5" 

	for i in 1:5 
	    x = rand(Float32, 448, 256, 256, 1, 1)
	    y = rand(Float32, 448, 256, 256, 1, 1)
	    train_loss = minimize!(model, KnetArray(x), KnetArray(y))
	end

	println("")
    end
end

main()

Setup is Julia 1.5.1, CUDA v1.3.3, Knet v1.4.1 https://github.com/denizyuret/Knet.jl.git#dy/610

This gives me that error on a K80 with 12 gigs of RAM per core -- you may have to adjust the size of the rand arrays depending on how much memory you have -- I basically just played with the numbers. If they're too small, the error doesn't show up, so it seems to be under high memory pressure.

A question for @maleadt: I am doing eager gc with unsafe_free! during the backward pass here. I use WeakRefs to avoid hanging on to memory in my data structures used to figure out when an array can no longer be in use. Sometimes the values of these WeakRefs come back as nothing even though they were initialized as CuArray in the beginning. I assume this happens when the regular gc gets to these CuArrays before me, but just wanted to make sure: does the value of a WeakRef turn into nothing when the object is garbage collected?

does the value of a WeakRef turn into nothing when the object is garbage collected

Correct. Ref JuliaLang/julia#26745