Feature: Add support for scanning parquet from cloud storage / S3
catkins opened this issue · 12 comments
(let me know if this is already possible with extra existing plumbing)
So in the python bindings for polars, it is able to do the more optimised byte-range trickery for working with parquet files in S3, plus support for scanning directories of parquet files in S3.
I'm also happy to help contribute this at some point if you think it'd be worthwhile.
Awesome, I'll see how I go.
I'm looking at the relevant bit of py-polars, so I'll see if I can plumb it through in a similar way.
https://github.com/pola-rs/polars/blob/main/py-polars/src/lazyframe.rs/#L258-L313
Looking at it more, it'd require including a TLS library in Rust (for HTTPS connections), which isn't something I'd like to do right now, so think it'd be better to try to do this outside of the gem with the Ruby S3 client.
To clarify I grok how the tls lib is getting pulled in:
- adding the
aws
feature to the polars.depenencies - adds in
reqwest
as specified by the polars-io crate - which in turn pulls in the tls library
it'd be better to try to do this outside of the gem with the Ruby S3 client
So that was actually my first approach, but I got tripped up on LazyFrame#new_from_parquet
only accepting a file path and not able to pass in some kind of IO
type.
Happy to hear if I was parking up the wrong tree though...
I'm most of the way through the first part of plumbing in the aws
feature and wiring it through to lazyframe locally, so I can throw up a fork to look at and chat about after I get it more solid.
Failing that, I guess my fallback for doing what I was hoping to do was using the red-arrow gems, but polars-ruby was much more user friendly to use and install.
To satisfy my own curiosity, I did a rough-cut PR implementing it.
If you're not keen on supporting it at this time, thats understandable too.
I would also enjoy this addition if @ankane would consider. (Can you tell I'm spoiled by DuckDB?)
We're considering doing this exact thing; would be awesome if reliable support for this feature was added!
We've got a few use cases that having this would simplify for us too, so I'd love to have it in the library and leveraging the rust implementations.
Good ol' "reverse ETL"; same situation here
@ankane at work we also tried to do this outside of this gem but got stuck on the same limitation as @catkins that the scan_parquet
method does not accept IO objects.
We can get around it by using a TempFile
but then there is no way to do partial reads from S3 in order to process larger-than-memory datasets (or just to improve performance by limiting network usage).
I'm of the same opinion as others that this gem should match the python library in terms of functionality and therefore that we should add support for connecting to S3 directly in this gem.
Hi all, this will be included in 0.15.0 (but don't have a timeline yet).