Writing and Reading Random Access Files

Question

Writing and Reading Random Access Files

okartal opened this issue 2 years ago · 3 comments

Maybe related to #353

It is already possible to use Tables.partitioner to write record batches to a single Arrow file. However, when I read that file with Arrow.Table I do not know how to access a specific record batch like here: https://arrow.apache.org/docs/java/ipc.html#writing-and-reading-random-access-files

According to the docs, this should be possible but I am not sure if that is not implemented yet or simply not documented.

Answer 1 · 2023-05-23T00:06:27.000Z

You're right that we don't expose this very well (i.e at all) via Arrow.Table right now; but using Arrow.Stream gives you back an iterator of Arrow.Table for each record batch. But we could probably also expose a way via Arrow.Table to let you get the individual tables. Something to think about, or at least improve in the docs mentioning Arrow.Stream.

Answer 2 · 2023-05-29T21:51:11.000Z

According to https://arrow.apache.org/docs/python/ipc.html#writing-and-reading-random-access-files we need to use a seek method to implement random access to a batch

Answer 3 · 2023-05-29T21:55:21.000Z

we don't have to do any Python implementation says, that's specifically for Python. A batch is a well defined thing in file format, independent of which implementation we're talking about, it's purely a logical question of how do we get there given the schema / metadata and what's the interface for user