apache/arrow-julia

Removing .arrow files without closing Julia seems impossible in Windows

Tortar opened this issue · 18 comments

Tortar commented

Hi,

I'm a bit in a desperate situation since I'd like to resolve an issue in Agents.jl involving the Arrow.jl package on Windows but even if I tried hard I couldn't: it seems to me that in reality it could be a problem with Arrow.jl since all my (maybe poor) attempts to solve the problem failed. This is the PR trying to solve it: JuliaDynamics/Agents.jl#909.

In practice we have currently in Agents.jl here a test which performs:

            # Arrow, fails on Windows (see issue #826 (https://github.com/JuliaDynamics/Agents.jl/issues/826))
            if !(Sys.iswindows())
                offline_run!(model, 365 * 5;
                    when_model = each_year,
                    when = six_months,
                    backend = :arrow,
                    mdata = [:flag, :year],
                    adata = [(:weight, mean)],
                    writing_interval = 3
                )

                adata_saved = DataFrame(Arrow.Table("adata.arrow"))
                @test size(adata_saved) == (11, 2)
                @test propertynames(adata_saved) == [:step, :mean_weight]

                mdata_saved = DataFrame(Arrow.Table("mdata.arrow"))
                @test size(mdata_saved) == (6, 3)
                @test propertynames(mdata_saved) == [:step, :flag, :year]

                @test size(vcat(DataFrame.(Arrow.Stream("adata.arrow"))...)) == (11, 2)
                @test size(vcat(DataFrame.(Arrow.Stream("mdata.arrow"))...)) == (6, 3)

                rm("adata.arrow")
                rm("mdata.arrow")
                @test !isfile("adata.arrow")
                @test !isfile("mdata.arrow")
            end

On Linux and MacOS it works fine, but on Windows the operating system refuses to remove the .arrow files giving:

ERROR: IOError: unlink("adata.arrow"): permission denied (EACCES)
Stacktrace:
 [1] uv_error
   @ .\libuv.jl:100 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem .\file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem .\file.jl:283
 [4] rm(path::String)
   @ Base.Filesystem .\file.jl:273
 [5] top-level scope
   @ REPL[17]:1

As you can see for Windows it is disables since we didn't manage to make it work.

It's probably necessary to describe what is the relationship between the Agents.jl function offline_run! and Arrow.jl, it is rather simple, offline_run! calls multiple times this other function:

function Agents.writer_arrow(filename, data, append)
    if append
        Arrow.append(filename, data)
    else
        Arrow.write(filename, data; file = false)
    end
end

I'm trying multiple ways to solve the issue, but none worked. I will try to provide an MWE if necessary.

Moelf commented

Arrow uses memory map which means as long as the memory map is not garbage collected, Windows can't delete, because windows filesystem is kinda annoying.

What you should do is to make sure reference is cleared and forced GC by running GC.gc() a few times

Tortar commented

thanks @Moelf .

I tried this code, but removal still gives error

            adata_arrow = Arrow.Table("adata.arrow")
            adata_saved = DataFrame(adata_arrow)
            @test size(adata_saved) == (11, 2)
            @test propertynames(adata_saved) == [:step, :mean_weight]

            mdata_arrow = Arrow.Table("mdata.arrow")
            mdata_saved = DataFrame(mdata_arrow)
            @test size(mdata_saved) == (6, 3)
            @test propertynames(mdata_saved) == [:step, :flag, :year]

            adata_stream_arrow = Arrow.Stream("adata.arrow")
            mdata_stream_arrow = Arrow.Stream("mdata.arrow")
            @test size(vcat(DataFrame.(adata_stream_arrow)...)) == (11, 2)
            @test size(vcat(DataFrame.(mdata_stream_arrow)...)) == (6, 3)

            if Sys.iswindows()
                adata_arrow = nothing
                mdata_arrow = nothing
                mdata_stream_arrow = nothing
                adata_stream_arrow = nothing
                for _ in 1:5
                    GC.gc()
                end
            end
            rm("adata.arrow")
            rm("mdata.arrow")
            @test !isfile("adata.arrow")
            @test !isfile("mdata.arrow")

edit: yes actually there was a typo, will rerun tests

Moelf commented

adata_arrow = Arrow.Table("adata.arrow")

this is also a reference holding adata.arrow

Tortar commented

I fixes the typo in the above code and rerun test, still getting IOError: unlink("adata.arrow"): permission denied (EACCES), how should I clean up the reference? doesn't adata_arrow = nothing do it?

Moelf commented

I'm out of ideas here.

Tortar commented

thanks anyway for the help @Moelf .

To be able to bisect the issue more easily I created this MWE:

using Arrow, DataFrames

function writer_arrow(filename, data, append)
    if append
        Arrow.append(filename, data)
    else
        Arrow.write(filename, data; file = false)
    end
end

filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
    writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
GC.gc(); GC.gc()
rm("MWE.arrow")

which gives when runned in REPL:

julia> rm("MWE.arrow")
ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)
Stacktrace:
 [1] uv_error
   @ .\libuv.jl:100 [inlined]
 [2] unlink(p::String)
   @ Base.Filesystem .\file.jl:972
 [3] rm(path::String; force::Bool, recursive::Bool)
   @ Base.Filesystem .\file.jl:283
 [4] rm(path::String)
   @ Base.Filesystem .\file.jl:273
 [5] top-level scope
   @ REPL[11]:1

Version info of my settings:

julia> versioninfo()
Julia Version 1.9.3
Commit bed2cd540a (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 12 × AMD Ryzen 5 5600H with Radeon Graphics
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver3)
  Threads: 1 on 12 virtual cores

Hopefully this will help someone who can solve the issue :-)

Tortar commented

ok strangely enough to me if you add data_saved = nothing the file can gets removed without problems. This sounds strange to me since data_saved is a DataFrame which even if created with an arrow table shouldn't be related to it, do you have any idea on how is this possible?

e.g. this works fine

using Arrow, DataFrames

function writer_arrow(filename, data, append)
    if append
        Arrow.append(filename, data)
    else
        Arrow.write(filename, data; file = false)
    end
end

filename = "MWE.arrow"
data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
writer_arrow(filename, data, false)
for i in 1:10
    writer_arrow(filename, data, true)
end
table = Arrow.Table("MWE.arrow")
data_saved = DataFrame(table)
table = nothing
data_saved = nothing
GC.gc(); GC.gc()
rm("MWE.arrow")
Tortar commented

ohh ok got it, the dataframe doesn't copy the content of the arrow table, just steal the reference, if one tries:

julia> deleteat!(data_saved, 2)
ERROR: ArgumentError: `Arrow.Primitive{Int64, Vector{Int64}}` does not support `deleteat!`; arrow data is by nature immutable
Stacktrace:
 [1] deleteat!(x::Arrow.Primitive{Int64, Vector{Int64}}, inds::Vector{Int64})
   @ Arrow C:\Users\bob\.julia\packages\Arrow\R2Rvz\src\arraytypes\arraytypes.jl:32
 [2] deleteat!(A::SentinelArrays.ChainedVector{Int64, Arrow.Primitive{Int64, Vector{Int64}}}, inds::Vector{Int64})
   @ SentinelArrays C:\Users\bob\.julia\packages\SentinelArrays\cav7N\src\chainedvector.jl:560
 [3] _deleteat!_helper(df::DataFrame, drop::Vector{Int64})
   @ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:919
 [4] deleteat!(df::DataFrame, inds::Int64)
   @ DataFrames C:\Users\bob\.julia\packages\DataFrames\58MUJ\src\dataframe\dataframe.jl:861
 [5] top-level scope
   @ REPL[33]:1

is this correct behaviour though?

It isn't even possible to add new rows to this dataframe since the underlying arrow table is immutable:

julia> append!(data_saved, data)
ERROR: MethodError: no method matching resize!(::Arrow.Primitive{Int64, Vector{Int64}}, ::Int64)

Closest candidates are:
  resize!(::Vector, ::Integer)
   @ Base array.jl:1246
  resize!(::BitVector, ::Integer)
   @ Base bitarray.jl:814
  resize!(::SparseArrays.ReadOnly, ::Any)
   @ SparseArrays C:\Users\bob\AppData\Local\Programs\Julia-1.9.3\share\julia\stdlib\v1.9\SparseArrays\src\readonly.jl:33
  ...

....
Tortar commented

All in all, I think this is probably a problem with DataFrames.jl.

edit: okay, actually it is specified in the DataFrames.jl docs so problem solved:

Another example is Arrow.Table, where arrow data is inherently immutable so columns can't be accidentally mutated anyway. To be able to mutate arrow data, columns must be materialized,
which can be accomplished via DataFrame(arrow_table, copycols=true).

Tortar commented

Reopening since even with this new information I couldn't get the code to work on Windows in the tests of Agents.jl and so...I got a new updated MWE:

using Arrow, DataFrames, Test
@testset "reproduce" begin
    data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
    Arrow.write("MWE.arrow", data; file = false)
    DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
    nothing
    for _ in 1:10 GC.gc() end
    rm("MWE.arrow")
end

this again throws "ERROR: IOError: unlink("MWE.arrow"): permission denied (EACCES)", notice that strangely if I remove the if statement wrapping all the code everything works fine.

edit: using a testset to make it more realistic

Moelf commented

GC.gc()

try to add a few more GC.gc() call

Tortar commented

updated the code with 10 GC runs, same error

Moelf commented

maybe try isolate the workload into a function, call that function, and run GC outside? idk if you're testing this in REPL or not, but in REPL you would have ans referencing to the last line

Tortar commented

I'm testing in REPL but ans is nothing in this case though, anyway changing it in

using Arrow, DataFrames, Test
function create_file()
    data = DataFrame(a=1:4, b=["M", "F", "F", "M"])
    Arrow.write("MWE.arrow", data; file = false)
    DataFrame.(Arrow.Stream("MWE.arrow"), copycols=true)
end
@testset "reproduce" begin
    create_file()
    GC.gc()
    rm("MWE.arrow")
end

removes the file fine.

But still I have problems understanding why that other version without the function call instead gives error, could this be a GC issue?

Moelf commented

well, I think global scope and running GC.gc() in a for loop is not very clean conceptually.

Tortar commented

I'm feeling that this problem is similar to https://discourse.julialang.org/t/how-to-force-an-object-to-be-freed-by-the-garbage-collector/77985. (also, even if I change for _ in 1:10 GC.gc() end with GC.gc(); GC.gc(); GC.gc(); ... the error is still there)

Moelf commented

I think the general problem is GC interaction happens with compiled code, and the code we human write are not what actually happens line by line at runtime, so you shouldn't rely on compiler does something specific.

I agree compiler should account for GC effects but that's probably hard, meanwhile just remember to isolate by unit of individual function gets you very far already

Tortar commented

thanks for all the help @Moelf, do you think we can close this one then since the problem seems unrelated to Arrow in particular?