Add parquet support
Opened this issue · 0 comments
This is going to be substantially more challenging than the csv format, but might be rewarding. We want to add Apache Parquet support as Parquet is used in a very large number of real-world data science applications.
But one challenge right now is that we require a byte pointer to a specific record whereas Parquet is columnar, meaning records are split across different locations.
https://github.com/apache/parquet-format
We likely will need to write our own Parquet file parser to figure out the correct byte offset, then in the js library be pretty particular about how exactly that record is fetched/parsed. This might involve needing to return additional metadata beyond just the byte offset, which we can do via an intermediate pointer in the index.
Anyways, let's talk about this one before working on it, it'll be super educational about how Parquet works but I don't want us to get lost in the complexity.