delta-io/delta-rs

Refactor writer implementations

wjones127 opened this issue · 1 comments

Description

For further progress on the writer, we'll need to implement features that require query engines. This includes:

  • Higher writer protocols:
    • V2: column invariants (#592)
    • V3: CHECK constraints
    • V4: generated columns
  • Write types that require rewriting files:
  • DELETE
  • UPDATE
  • MERGE (#850)

We can provide a default one with DataFusion, but we will also have users that wish to plug in their own query engine. In addition, it's possible we may have users that wish to user their own Parquet writer (for distributed engines, for example). So we will likely want to refactor into three distinct layers:

  1. A transaction layer for those who want to use their own Parquet writer to handle data writes (you write data; we write transaction);
  2. A parametrized writer layer, who want to use their own query engine but will use the built-in data writer (you verify data; we write data and transaction);
  3. A DataFusion-based writer that handles everything (verification, writing, transaction).

I'm not sure how viable this is yet, and would welcome feedback from others.

Use Case

Related Issue(s)

houqp commented

A transaction layer for those who want to use their own Parquet writer to handle data writes (you write data; we write transaction);

If the trait only abstracts out the parquet io logic, I don't think we should call it the transaction layer? Because a transaction in delta contains more than just the data and checkpoint parquets.

I do think we should make the parquet implementation swapable through traits though, so that we can serve both arrow-rs and parquet2/arrow2 users.

The query engine trait abstraction makes total sense to me 👍

A DataFusion-based writer that handles everything (verification, writing, transaction).

Should the writer be implemented using the query engine trait? Then we only need to implement the query engine trait for Datafusion.

Overall, I think this looks like a good attack plan. It looks like we are on a good path towards a full-fledged native DeltaLake implementation!