delta-io/delta-sharing

Update delta-kernel to at least 0.4.0 to leverage a lazy `scan.execute` for large tables

Closed this issue · 2 comments

Currently, delta kernel loads the entire table in memory, causing all sorts of problems when dealing with large enough data.
From delta-kernel 0.4.0 the scan.execute is lazy and only loads as the iterator is consumed.

I've a fork of your python library where I don't load the entire dataset into a pyarrow.Table but instead work on each RecordBatch separately for memory efficiency reasons. This is currently pointless as delta-kernel is eagerly loading in the supported 0.2.x version.
Supporting 0.4.x would open up a lot of possibilities for large data processing.

Any short term plans on supporting delta-kernel 0.4.x ?
I'm new to rust and couldn't make it work sadly

relevant bit of the changelog:

Scan's execute(..) method now returns a lazy iterator instead of materializing a Vec<ScanResult>

source: https://github.com/delta-incubator/delta-kernel-rs/blob/bd2ea9f2fa44d8bc559659e53d38374309ecf63a/CHANGELOG.md#v040-2024-10-23

Hi, in release 1.3.1 of delta-sharing Python, we upgraded to delta-kernel-rs 0.6.0, and scan.execute is lazy now. Lazy loading actually wasn't possible to do in the Rust wrapper until 0.6.0 due to lifetime requirements on scan.execute.

Thanks!!