ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)
andevellicus opened this issue · 7 comments
With the recent Knet 1.4.1 update I randomly get the following error:
ERROR: LoadError: MethodError: no method matching unsafe_free!(::Nothing)
Closest candidates are:
unsafe_free!(::CuArray) at /home/andevellicus/.julia/packages/CUDA/dZvbp/src/array.jl:35
unsafe_free!(::CUDA.CUSPARSE.CuSparseVector) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cusparse/array.jl:34
unsafe_free!(::CUDA.CUFFT.CuFFTPlan) at /home/andevellicus/.julia/packages/CUDA/dZvbp/lib/cufft/fft.jl:27
Stacktrace:
[1] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/andevellicus/.julia/packages/Knet/8aEsn/src/autograd_gpu/gcnode.jl:88
[2] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:168
[3] differentiate at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
[4] minimize!(::SEUNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at xxx/train.jl:14
[5] main() at /home/andevellicus/Programming/ML/julia/knet-sdh-seg/train.jl:85
[6] top-level scope at xxx/train.jl:109
[7] include(::Function, ::Module, ::String) at ./Base.jl:380
[8] include(::Module, ::String) at ./Base.jl:368
[9] exec_options(::Base.JLOptions) at ./client.jl:296
[10] _start() at ./client.jl:506
in expression starting at xxx/train.jl:109
It isn't consistent, but seems to pop up at random points during batch training.
I do not quite understand why this can happen so it would be nice to have an MWE.
However I implemented a fix with Pkg.add("Knet#dy/610")
which will probably fix it. Let me know if this works.
That fixed it from crashing. I do get this though:
┌ Warning: gcnode error: c=Nothing v=230 ni=230
└ @ Knet.AutoGrad_gpu ~/.julia/packages/Knet/OEZWM/src/autograd_gpu/gcnode.jl:90
I'll see if I can get a MWE up.
Not sure if this is at all helpful, but that warning message appears only once. In one instance it showed up in Epoch 1, in another instance with a smaller model it showed up in Epoch 3, and never again.
MWE:
using Knet
using CUDA
setoptim!(m, optimizer) = for p in params(m); p.opt = Knet.clone(optimizer); end
dice(x, y; smooth::Float32=1.f0) = (2*sum(y .* x) + smooth) / (sum(y.^2) + sum(x.^2) + smooth)
loss(x, y) = 1 - dice(x, y)
function minimize!(model, x::KnetArray, y::KnetArray)
ld = @diff loss(model(x), y)
for w in params(model)
Knet.update!(w, grad(ld, w))
end
return value(ld)
end
# Define a chain of layers :
struct Chain; layers; end
(c::Chain)(x) = (for l in c.layers; x = l(x); end; x)
struct test_model; c; end
function (m::test_model)(x)
x = m.c(x)
return x
end
function test_model()
w = param(3, 3, 3, 1, 8)
c = Chain((
x->conv4(w, x, stride=2, padding=1),
x->unpool(x),
x->conv4(w, x, stride=2, padding=1),
x->unpool(x),
x->conv4(w, x, stride=2, padding=1),
x->unpool(x)
))
test_model(c)
end
# Main training loop
function main()
CUDA.device!(1)
# Get model
model = test_model()
setoptim!(model, Adam())
# Kick off the training loop
for i in 1:5
@info "Epoch $i of 5"
for i in 1:5
x = rand(Float32, 448, 256, 256, 1, 1)
y = rand(Float32, 448, 256, 256, 1, 1)
train_loss = minimize!(model, KnetArray(x), KnetArray(y))
end
println("")
end
end
main()
Setup is Julia 1.5.1, CUDA v1.3.3, Knet v1.4.1 https://github.com/denizyuret/Knet.jl.git#dy/610
This gives me that error on a K80 with 12 gigs of RAM per core -- you may have to adjust the size of the rand arrays depending on how much memory you have -- I basically just played with the numbers. If they're too small, the error doesn't show up, so it seems to be under high memory pressure.
A question for @maleadt: I am doing eager gc with unsafe_free! during the backward pass here. I use WeakRef
s to avoid hanging on to memory in my data structures used to figure out when an array can no longer be in use. Sometimes the values of these WeakRef
s come back as nothing
even though they were initialized as CuArray
in the beginning. I assume this happens when the regular gc
gets to these CuArray
s before me, but just wanted to make sure: does the value of a WeakRef
turn into nothing
when the object is garbage collected?
does the value of a
WeakRef
turn intonothing
when the object is garbage collected
Correct. Ref JuliaLang/julia#26745