delta-io/delta-rs

Implement vacuum command

xianwill opened this issue · 5 comments

delta-rs should have a "vacuum table" utility analogous to the one provided by the open source Spark Delta Lake implementation. This utility is useful for cleaning up old files that are no longer referenced by the delta log (e.g. files rewritten by merge statements, optimize command etc.).

See the VacuumCommand in the open source implementation for reference.

@fvaleye I think this is actually done right? I'm not clear what work we have left to do

@fvaleye I think this is actually done right? I'm not clear what work we have left to do

Yes, it is already implemented! Hum, we need to improve the tests suite: #227

Как насчет поправить документацию?
Наверное стоит подновить с вот этого
image
на вот это?
image

@rtyler @fvaleye It looks like there are still two serious issues with vacuum implementation:

  • vacuum lists all files in dataset using StorageBackend.list_objs. The problem is this function returns all files (including these in subdirectories) on s3 backend and gcs backend (althrough I'm not sure about gcs). On file and azure backends this function lists only first-level files (without recursing to subdirectories).
  • vacuum ignores files not referenced by delta log at all (so not included on DeltaTableState.files() and DeltaTableState.all_tombstones() lists).

Resolved by #669.