delta-io/delta-rs

Unit and integration tests for the vacuum command

rtyler opened this issue · 2 comments

Our preliminary support for vacuum would benefit from some more unit and integration tests to ensure that it's only deleting the right files whenever the vacuum is invoked.

danmx commented

Referencing @mrk-its comment from #97 so it's not lost.

@rtyler @fvaleye It looks like there are still two serious issues with vacuum implementation:

* vacuum lists all files in dataset using `StorageBackend.list_objs`. The problem is this function returns all files (including these in subdirectories) on `s3` backend and `gcs` backend (althrough I'm not sure about gcs). On `file` and `azure` backends this function  lists only first-level files (without recursing to subdirectories).

* vacuum ignores files not referenced by delta log at all (so not included on DeltaTableState.files() and DeltaTableState.all_tombstones() lists).

This should be well tested at this point, also with introduction of VACUUM START, VACUUM END, I added a couple tests on the Python side