apache/arrow-julia

`NTuple{UInt8}` not getting correctly written out

Moelf opened this issue · 3 comments

Moelf commented

similar to #411 but corresponds to fixedsizelist.jl:

julia> data1 = (; x = [(0x01, 0x02), (0x03, 0x04)])

julia> Arrow.write("/tmp/julia1.feather", data1)

julia> data2 = (; x = [b"\x01\x02", b"\x03\x04"])

julia> Arrow.write("/tmp/julia2.feather", data2)

julia> data3 = (; x = [(0x0001, 0x0002), (0x0003, 0x0004)])

julia> Arrow.write("/tmp/julia3.feather", data3)
In [12]: pyarrow.feather.read_table("/tmp/julia1.feather")["x"]
Out[12]:
<pyarrow.lib.ChunkedArray object at 0x7fd62050c400>
[
  [
    0102,
    0304
  ]
]

In [13]: pyarrow.feather.read_table("/tmp/julia2.feather")["x"]
Out[13]:
<pyarrow.lib.ChunkedArray object at 0x7fd62387ee30>
[
  [
    0102,
    0304
  ]
]

In [14]: pyarrow.feather.read_table("/tmp/julia3.feather")["x"]
Out[14]:
<pyarrow.lib.ChunkedArray object at 0x7fd62046da30>
[
  [
    [
      1,
      2
    ],
    [
      3,
      4
    ]
  ]
]
quinnj commented

Yeah, I agree this isn't ideal. At the time, I thought this was probably a reasonable way to translate to the arrow fixed size binary data type, but in reality, we should have tried a way to limit to only Base.CodeUnits like we do now for the list data type. The problem is that we now unequivocally treat Base.CodeUnits as list, so there's not a straightforward way to say, "hey, I have a vector of fixed size binary data and want the fixed size binary arrrow data type". We could create a wrapper like Arrow.FixedSizeBinary that people would have to use explicitly, but that's a bit annoying. Let me think on this one for just a bit.

In any case, we would probably want to modify the FixedSizeListKind in ArrowTypes to also have a 3rd type parameter to track whether the fixed size should be binary or not (since we don't want to unequivocally treat UInt8 eltype as binary, which is the core issue here).

Moelf commented

let me know if you want me to try my hands on this one (once you have a design idea)