Python: Support writer protocol V2 in write_deltalake

Question

Python: Support writer protocol V2 in write_deltalake

wjones127 opened this issue 3 years ago · 3 comments

wjones127 commented 3 years ago

Description

Most delta tables are now V2 by default, so writer isn't yet compatible with most systems.

Two features needed are:

Add support for enforcing delta.appendOnly. Just need to check the delta config, check the mode parameter, and throw an error if needed. (#590)
Add support for enforcing column invariants. This part is more complex. Need to parse the SQL expression into a PyArrow expression, evaluate that expression on the data, and error as soon an invalid data is found. (#592)

Related Issue(s)
Umbrella issue:#542

Answer 1 · 2022-04-16T01:45:51.000Z

@GraemeCliffe-inspirato Are you still interested in taking the append only part of this issue? If not, I may be able to look into it. If so I'll hold off and see if there's other good first issues for me on this :)

Answer 2 · 2022-04-18T01:10:10.000Z

@PadenZach I'm still interested in the appendOnly section! Thank you for checking! I have just now got the project building and tests running, so I'm just starting on the issue

Answer 3 · 2022-04-22T05:40:04.000Z

My initial guess on what needs to be done on invariants is something like this:

# ...inside of write_deltalake
def iter_batches(data, invariants) -> Iterator[RecordBatch]:
    for batch in data:
        for sql_clause in invariants:
           res = execute(sql_clause, batch)
           if res != True:
               raise ValueError("Invariant violated: ...")
           yield batch

invariants = configuration["delta.invariants"]
data = convertToRecordBatchReader(data)

batch_iter = iter_batches(data, invariants)
# ... pass batch_iter to write_dataset

The hard part is we don't have a SQL parser in the deltalake package, so not sure how that execute() function would work. One option is to turn on the datafusion delta-rs option in Python (which I suspect we might do eventually anyways) and then implement execute() in Rust using datafusion. Moving the record batch temporarily into Rust with zero-copy should be possible (that's used by python-datafusion), but might take a little bit of glue code.

It's probably worth researching what typical invariants are used and allowed by existing engines. The spec is very vague, but it's likely the Spark implementation has a limited set of column types and operations we need to care about.