Does `Arrow.write` have an upper limit for the number of columns?

Question

Does `Arrow.write` have an upper limit for the number of columns?

simsurace opened this issue 2 years ago · 1 comments

I could not find this documented:

using Arrow, DataFrames
df = DataFrame(("$i" => rand(1000) for i in 1:65536)...)
Arrow.write("out/df.arrow", df)

produces

julia> Arrow.write("data.arrow", df)
ERROR: MethodError: no method matching length(::Nothing)
Closest candidates are:
  length(::Union{Base.KeySet, Base.ValueIterator}) at abstractdict.jl:58
  length(::Union{LinearAlgebra.Adjoint{T, S}, LinearAlgebra.Transpose{T, S}} where {T, S}) at ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:172
  length(::Union{Tables.AbstractColumns, Tables.AbstractRow}) at ~/.julia/packages/Tables/AcRIE/src/Tables.jl:180
  ...
Stacktrace:
 [1] makeschema(b::Arrow.FlatBuffers.Builder, sch::Tables.Schema{nothing, nothing}, columns::Arrow.ToArrowTable)
   @ Arrow ~/.julia/packages/Arrow/P0wVk/src/write.jl:393
 [2] close(writer::Arrow.Writer{IOStream})
   @ Arrow ~/.julia/packages/Arrow/P0wVk/src/write.jl:244
 [3] open(::Arrow.var"#122#123"{DataFrame}, ::Type, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:file,), Tuple{Bool}}})
   @ Base ./io.jl:386
 [4] #write#121
   @ ~/.julia/packages/Arrow/P0wVk/src/write.jl:57 [inlined]
 [5] top-level scope
   @ REPL[94]:1

caused by: MethodError: no method matching length(::Nothing)
Closest candidates are:
  length(::Union{Base.KeySet, Base.ValueIterator}) at abstractdict.jl:58
  length(::Union{LinearAlgebra.Adjoint{T, S}, LinearAlgebra.Transpose{T, S}} where {T, S}) at ~/.julia/juliaup/julia-1.8.5+0.aarch64.apple.darwin14/share/julia/stdlib/v1.8/LinearAlgebra/src/adjtrans.jl:172
  length(::Union{Tables.AbstractColumns, Tables.AbstractRow}) at ~/.julia/packages/Tables/AcRIE/src/Tables.jl:180
  ...
Stacktrace:
 [1] makeschema(b::Arrow.FlatBuffers.Builder, sch::Tables.Schema{nothing, nothing}, columns::Arrow.ToArrowTable)
   @ Arrow ~/.julia/packages/Arrow/P0wVk/src/write.jl:393
 [2] makeschemamsg(sch::Tables.Schema{nothing, nothing}, columns::Arrow.ToArrowTable)
   @ Arrow ~/.julia/packages/Arrow/P0wVk/src/write.jl:430
 [3] macro expansion
   @ ~/.julia/packages/Arrow/P0wVk/src/write.jl:198 [inlined]
 [4] macro expansion
   @ ./task.jl:454 [inlined]
 [5] write(writer::Arrow.Writer{IOStream}, source::DataFrame)
   @ Arrow ~/.julia/packages/Arrow/P0wVk/src/write.jl:185
 [6] (::Arrow.var"#122#123"{DataFrame})(writer::Arrow.Writer{IOStream})
   @ Arrow ~/.julia/packages/Arrow/P0wVk/src/write.jl:58
 [7] open(::Arrow.var"#122#123"{DataFrame}, ::Type, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Bool, Tuple{Symbol}, NamedTuple{(:file,), Tuple{Bool}}})
   @ Base ./io.jl:384
 [8] #write#121
   @ ~/.julia/packages/Arrow/P0wVk/src/write.jl:57 [inlined]
 [9] top-level scope
   @ REPL[94]:1

Whereas it works with 65535 columns.

Answer 1 · 2023-04-04T22:47:41.000Z

seems fine with pyarrow

In [1]: import pyarrow.feather, numpy as np, pandas as pd

In [3]: df = pd.DataFrame({f"col_{k}": np.random.rand(100) for k in range(65538)})

In [4]: pyarrow.feather.write_feather(df, "/tmp/wide.feather", compression="uncompressed")

In [6]: pyarrow.feather.read_table("/tmp/wide.feather")["col_65537"]
Out[6]:
<pyarrow.lib.ChunkedArray object at 0x7fd528293c90>
[
  [
    0.3791875035442084,
    0.5547163201551565,
    0.13564446518017992,
    0.4183265184379561,
    0.8100731859852923,
    ...
    0.6820512183941593,
    0.6142216465909046,
    0.7692441575177542,
    0.07715418533522123,
    0.38896656434696375
  ]
]