Writing and Reading Random Access Files
okartal opened this issue · 3 comments
Maybe related to #353
It is already possible to use Tables.partitioner to write record batches to a single Arrow file. However, when I read that file with Arrow.Table I do not know how to access a specific record batch like here: https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files
According to the docs, this should be possible but I am not sure if that is not implemented yet or simply not documented.
You're right that we don't expose this very well (i.e at all) via Arrow.Table
right now; but using Arrow.Stream
gives you back an iterator of Arrow.Table
for each record batch. But we could probably also expose a way via Arrow.Table
to let you get the individual tables. Something to think about, or at least improve in the docs mentioning Arrow.Stream
.
According to https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files we need to use a seek method to implement random access to a batch
we don't have to do any Python implementation says, that's specifically for Python. A batch is a well defined thing in file format, independent of which implementation we're talking about, it's purely a logical question of how do we get there given the schema / metadata and what's the interface for user