Implement vacuum command
xianwill opened this issue · 5 comments
xianwill commented
delta-rs should have a "vacuum table" utility analogous to the one provided by the open source Spark Delta Lake implementation. This utility is useful for cleaning up old files that are no longer referenced by the delta log (e.g. files rewritten by merge statements, optimize command etc.).
See the VacuumCommand in the open source implementation for reference.
rtyler commented
@fvaleye I think this is actually done right? I'm not clear what work we have left to do
fvaleye commented
MironAtHome commented
mrk-its commented
@rtyler @fvaleye It looks like there are still two serious issues with vacuum implementation:
- vacuum lists all files in dataset using
StorageBackend.list_objs
. The problem is this function returns all files (including these in subdirectories) ons3
backend andgcs
backend (althrough I'm not sure about gcs). Onfile
andazure
backends this function lists only first-level files (without recursing to subdirectories). - vacuum ignores files not referenced by delta log at all (so not included on DeltaTableState.files() and DeltaTableState.all_tombstones() lists).