Removing .arrow files without closing Julia seems impossible in Windows
Tortar opened this issue · 18 comments
Hi,
I'm a bit in a desperate situation since I'd like to resolve an issue in Agents.jl involving the Arrow.jl package on Windows but even if I tried hard I couldn't: it seems to me that in reality it could be a problem with Arrow.jl since all my (maybe poor) attempts to solve the problem failed. This is the PR trying to solve it: JuliaDynamics/Agents.jl#909.
In practice we have currently in Agents.jl here a test which performs:
# Arrow, fails on Windows (see issue #826 (https://github.com/JuliaDynamics/Agents.jl/issues/826))
if !(Sys.iswindows())
offline_run!(model, 365 * 5;
when_model = each_year,
when = six_months,
backend = :arrow,
mdata = [:flag, :year],
adata = [(:weight, mean)],
writing_interval = 3
)
adata_saved = DataFrame(Arrow.Table("adata.arrow"))
@test size(adata_saved) == (11, 2)
@test propertynames(adata_saved) == [:step, :mean_weight]
mdata_saved = DataFrame(Arrow.Table("mdata.arrow"))
@test size(mdata_saved) == (6, 3)
@test propertynames(mdata_saved) == [:step, :flag, :year]
@test size(vcat(DataFrame.(Arrow.Stream("adata.arrow"))...)) == (11, 2)
@test size(vcat(DataFrame.(Arrow.Stream("mdata.arrow"))...)) == (6, 3)
rm("adata.arrow")
rm("mdata.arrow")
@test !isfile("adata.arrow")
@test !isfile("mdata.arrow")
end
On Linux and MacOS it works fine, but on Windows the operating system refuses to remove the .arrow files giving:
ERROR: IOError: unlink("adata.arrow"): permission denied (EACCES)
Stacktrace:
[1] uv_error
@ .\libuv.jl:100 [inlined]
[2] unlink(p::String)
@ Base.Filesystem .\file.jl:972
[3] rm(path::String; force::Bool, recursive::Bool)
@ Base.Filesystem .\file.jl:283
[4] rm(path::String)
@ Base.Filesystem .\file.jl:273
[5] top-level scope
@ REPL[17]:1
As you can see for Windows it is disables since we didn't manage to make it work.
It's probably necessary to describe what is the relationship between the Agents.jl function offline_run!
and Arrow.jl, it is rather simple, offline_run!
calls multiple times this other function:
function Agents.writer_arrow(filename, data, append)
if append
Arrow.append(filename, data)
else
Arrow.write(filename, data; file = false)
end
end
I'm trying multiple ways to solve the issue, but none worked. I will try to provide an MWE if necessary.
Arrow uses memory map which means as long as the memory map is not garbage collected, Windows can't delete, because windows filesystem is kinda annoying.
What you should do is to make sure reference is cleared and forced GC by running GC.gc() a few times
thanks @Moelf .
I tried this code, but removal still gives error
adata_arrow = Arrow.Table("adata.arrow")
adata_saved = DataFrame(adata_arrow)
@test size(adata_saved) == (11, 2)
@test propertynames(adata_saved) == [:step, :mean_weight]
mdata_arrow = Arrow.Table("mdata.arrow")
mdata_saved = DataFrame(mdata_arrow)
@test size(mdata_saved) == (6, 3)
@test propertynames(mdata_saved) == [:step, :flag, :year]
adata_stream_arrow = Arrow.Stream("adata.arrow")
mdata_stream_arrow = Arrow.Stream("mdata.arrow")
@test size(vcat(DataFrame.(adata_stream_arrow)...)) == (11, 2)
@test size(vcat(DataFrame.(mdata_stream_arrow)...)) == (6, 3)
if Sys.iswindows()
adata_arrow = nothing
mdata_arrow = nothing
mdata_stream_arrow = nothing
adata_stream_arrow = nothing
for _ in 1:5
GC.gc()
end
end
rm("adata.arrow")
rm("mdata.arrow")
@test !isfile("adata.arrow")
@test !isfile("mdata.arrow")
edit: yes actually there was a typo, will rerun tests
adata_arrow = Arrow.Table("adata.arrow")
this is also a reference holding adata.arrow
I fixes the typo in the above code and rerun test, still getting IOError: unlink("adata.arrow"): permission denied (EACCES)
, how should I clean up the reference? doesn't adata_arrow = nothing
do it?
I'm out of ideas here.
thanks anyway for the help @Moelf .
To be able to bisect the issue more easily I created this MWE:
using Arrow, DataFrames
function writer_arrow(filename, data, append)
if append
Arrow.append(filename, data)
else
Arrow.write(filename, data; file = false)
end
end
filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
GC.gc(); GC.gc()
rm("MWE.arrow")
which gives when runned in REPL:
julia> rm("MWE.arrow")
ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)
Stacktrace:
[1] uv_error
@ .\libuv.jl:100 [inlined]
[2] unlink(p::String)
@ Base.Filesystem .\file.jl:972
[3] rm(path::String; force::Bool, recursive::Bool)
@ Base.Filesystem .\file.jl:283
[4] rm(path::String)
@ Base.Filesystem .\file.jl:273
[5] top-level scope
@ REPL[11]:1
Version info of my settings:
julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: Windows (x86_64-w64-mingw32)
CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
Threads: 1 on 12 virtual cores
Hopefully this will help someone who can solve the issue :-)
ok strangely enough to me if you add data_saved = nothing
the file can gets removed without problems. This sounds strange to me since data_saved
is a DataFrame
which even if created with an arrow table shouldn't be related to it, do you have any idea on how is this possible?
e.g. this works fine
using Arrow, DataFrames
function writer_arrow(filename, data, append)
if append
Arrow.append(filename, data)
else
Arrow.write(filename, data; file = false)
end
end
filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
data_saved = nothing
GC.gc(); GC.gc()
rm("MWE.arrow")
ohh ok got it, the dataframe doesn't copy the content of the arrow table, just steal the reference, if one tries:
julia> deleteat!(data_saved, 2)
ERROR: ArgumentError: `Arrow.Primitive{Int64, Vector{Int64}}` does not support `deleteat!`; arrow data is by nature immutable
Stacktrace:
[1] deleteat!(x::Arrow.Primitive{Int64, Vector{Int64}}, inds::Vector{Int64})
@ Arrow C:\Users\bob\.julia\packages\Arrow\R2Rvz\src\arraytypes\arraytypes.jl:32
[2] deleteat!(A::SentinelArrays.ChainedVector{Int64, Arrow.Primitive{Int64, Vector{Int64}}}, inds::Vector{Int64})
@ SentinelArrays C:\Users\bob\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:560
[3] _deleteat!_helper(df::DataFrame, drop::Vector{Int64})
@ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:919
[4] deleteat!(df::DataFrame, inds::Int64)
@ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:861
[5] top-level scope
@ REPL[33]:1
is this correct behaviour though?
It isn't even possible to add new rows to this dataframe since the underlying arrow table is immutable:
julia> append!(data_saved, data)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64, Vector{Int64}}, ::Int64)
Closest candidates are:
resize!(::Vector, ::Integer)
@ Base array.jl:1246
resize!(::BitVector, ::Integer)
@ Base bitarray.jl:814
resize!(::SparseArrays.ReadOnly, ::Any)
@ SparseArrays C:\Users\bob\AppData\Local\Programs\Julia-1.9.3\share\julia\stdlib\v1.9\SparseArrays\src\readonly.jl:33
...
....
All in all, I think this is probably a problem with DataFrames.jl.
edit: okay, actually it is specified in the DataFrames.jl docs so problem solved:
Another example is
Arrow.Table
, where arrow data is inherently immutable so columns can't be accidentally mutated anyway. To be able to mutate arrow data, columns must be materialized,
which can be accomplished viaDataFrame(arrow_table, copycols=true)
.
Reopening since even with this new information I couldn't get the code to work on Windows in the tests of Agents.jl and so...I got a new updated MWE:
using Arrow, DataFrames, Test
@testset "reproduce" begin
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
Arrow.write("MWE.arrow", data; file = false)
DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
nothing
for _ in 1:10 GC.gc() end
rm("MWE.arrow")
end
this again throws "ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)", notice that strangely if I remove the if statement wrapping all the code everything works fine.
edit: using a testset to make it more realistic
GC.gc()
try to add a few more GC.gc()
call
updated the code with 10 GC runs, same error
maybe try isolate the workload into a function, call that function, and run GC outside? idk if you're testing this in REPL or not, but in REPL you would have ans
referencing to the last line
I'm testing in REPL but ans
is nothing in this case though, anyway changing it in
using Arrow, DataFrames, Test
function create_file()
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
Arrow.write("MWE.arrow", data; file = false)
DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
end
@testset "reproduce" begin
create_file()
GC.gc()
rm("MWE.arrow")
end
removes the file fine.
But still I have problems understanding why that other version without the function call instead gives error, could this be a GC issue?
well, I think global scope and running GC.gc() in a for loop
is not very clean conceptually.
I'm feeling that this problem is similar to https://discourse.julialang.org/t/how-to-force-an-object-to-be-freed-by-the-garbage-collector/77985. (also, even if I change for _ in 1:10 GC.gc() end
with GC.gc(); GC.gc(); GC.gc(); ...
the error is still there)
I think the general problem is GC interaction happens with compiled code, and the code we human write are not what actually happens line by line at runtime, so you shouldn't rely on compiler does something specific.
I agree compiler should account for GC effects but that's probably hard, meanwhile just remember to isolate by unit of individual function gets you very far already