Update delta-kernel to at least 0.4.0 to leverage a lazy `scan.execute` for large tables
Closed this issue · 2 comments
Currently, delta kernel loads the entire table in memory, causing all sorts of problems when dealing with large enough data.
From delta-kernel 0.4.0 the scan.execute
is lazy and only loads as the iterator is consumed.
I've a fork of your python library where I don't load the entire dataset into a pyarrow.Table but instead work on each RecordBatch separately for memory efficiency reasons. This is currently pointless as delta-kernel is eagerly loading in the supported 0.2.x version.
Supporting 0.4.x would open up a lot of possibilities for large data processing.
Any short term plans on supporting delta-kernel 0.4.x ?
I'm new to rust and couldn't make it work sadly
relevant bit of the changelog:
Scan's execute(..) method now returns a lazy iterator instead of materializing a Vec<ScanResult>
Hi, in release 1.3.1 of delta-sharing Python, we upgraded to delta-kernel-rs 0.6.0, and scan.execute
is lazy now. Lazy loading actually wasn't possible to do in the Rust wrapper until 0.6.0 due to lifetime requirements on scan.execute
.
Thanks!!