2023 H1 Roadmap
wjones127 opened this issue · 7 comments
Work committed to
These are projects current contributors are working on.
- (P0) Data Acceptance Tests running in CI (@wjones127)
- (P0) Fully protocol compliant optimistic commit protocol (conflict resolution). - (#632) (@roeap)
- (P0) ADBC driver: create / read / append / overwrite (@wjones127)
- Lay foundation for DuckDB plugin, more language bindings (R), and cross-language Polars support (R and Javascript, in addition to Python)
- (P1) Python bindings integrated with ADBC driver (@wjones127)
- ADBC to supersede PyArrow-based reader / writer.
- (P0) Remove experimental marker from Python writer (@wjones127)
- (P0) Writer version 2 support in operation module (@wjones127)
- (TBD) Provide async features in the Python binding (@fvaleye)
- (TBD) Airbyte <> Delta Lake integration (@fvaleye)
- More Rust documentation
- Figure out where to host
- Figure out SEO
- Probably migrate off of github.io
- Blog posts (@MrPowers)
- PyO3 blog post good for Rust audience
- Content for Azure. Developer advocacy arm of Azure is very impressive. They spread this message.
- Usage of the Python module is more compelling
- Kafka-delta-ingest reduced writer cost 25 times. Christian & Tyler co-authors.
- Purge Ruby bindings. They’re not usable.
Projects seeking contributors
In addition to smaller issues labelled good-first-issue, these are some larger projects that we could use some help on. Most of them will be implemented as part of the operations
module in the Rust source and can later be exposed to Python and other bindings.
-
DELETE
operation (#832) -
UPDATE
operation (#1126) -
MERGE
operation (#850) -
OPTIMIZE
operation, which currently only works on append-only tables (#1125)- Z-order implementation (#1127)
- Optimized Parquet compaction (apache/arrow-rs#1711)
- Optimize
VACUUM
with bulk requests (#405, apache/arrow-rs#2615) - Support column mapping (#930)
- Support deletion vector (#1094)
- Create a file caching layer (#769)
This looks great! Really excited!
Some blog post ideas:
- deltalake 0.7.0 post explaining the new features
- Delta Lake + AWS Lambda (from the aws-sdk-pandas work being done by @nkarpov)
- Why delta-rs is switching to ADBC (I think the Rust data community would be interested in this one)
Let me know if I should make issues for the blog posts. I'm fine tracking them elsewhere too. I'll want delta-rs community reviews, but we can just do those in the Slack chat. Thanks for putting this together.
@MrPowers I'm interested in taking up Delta Lake + AWS Lambda blog post. Can you help me out with the process?
@wjones127 maybe a silly question but why would you still need the Operations API that only uses data fusion (in rust) after introducing the ADBC API?
From the design document I can see any query engine can potentially be used with ADBC.
Why implement optimize and zorder when databricks is going to the opposite side with Liquid Clustering. By the moment delta-rs implement this, databricks will have made Liquid Clustering the default.
Why implement optimize and zorder when databricks is going to the opposite side with Liquid Clustering. By the moment delta-rs implement this, databricks will have made Liquid Clustering the default.
But they are already implemented in delta-rs.
Why implement optimize and zorder when databricks is going to the opposite side with Liquid Clustering. By the moment delta-rs implement this, databricks will have made Liquid Clustering the default.
Delta-rs team actually implemented these two features before the announcement of delta 3.0 and liquid clustering. To be honest, delta 3.0 and liquid clustering came out kinds of unexpectedly