duckdb/dbt-duckdb

Support for delta

Yacobolo opened this issue · 9 comments

Looking forward to the support for delta. This would enable us to run a poor man's data lakehouse! Do you need any help? What is the eta - this year?

jwills commented

Ack, sorry for the lag here @Yacobolo, I was on the road and missed this going by. I would like to have a plugin that supported Delta akin to the one I have for Iceberg; I'm assuming it would use the deltalake python package, but I personally don't have access to a Delta lake instance and tbh don't really care enough about learning how to setup a real one to do it myself "for fun."

However, if you (or anyone else!) does have a Delta lake instance and you know it should be configured as a dbt-duckdb plugin, I would most definitely be happy to merge it in.

Hi, @jwills, I would like to try this integration. This would be my first contribution, so I would appreciate some help and guidance at the beginning.

I did a first draft of read plugin integration here
and doing parallel an example project here where i showcase it

Here is the source configuration which loads data as the source with file and projection prunning

What workflow works best for you that you are able to give a feedback?

jwills commented

Hey @milicevica23, thanks so much for taking a crack at this!

The code as-written makes sense to me, but I have to be honest that I don't have a great sense for how folks actually use the deltalake python module in the real world-- like, do folks really use delta tables w/o a catalog? https://delta-io.github.io/delta-rs/python/usage.html#loading-a-delta-table

The nice thing is that you can but should not use a catalog to know where your table is and i thought to implement support for both ways. Or at least try to do it..
You can think of that as that we add a new file format to external files and not everybody who is on prem or doing simple projects have catalogs. But would be happy to hear feedback from others

Same here, the main use case is not the catalog, but more the metadata it generates together with the ACID transactions and time travel / change history🔥

jwills commented

Alright, super cool. So @milicevica23 if you would put your change together as a PR and other folks on this thread can weigh in on any additional config options we need to support those use cases, that would be great!

Sure, i will open an draft PR.

The things still to do

  • check/rename naming for config params (feedback appreciate)
  • add support for time traveling to load older versions of the tables
  • add support for at least azure because i have access to, if you have some guidance for AWS s3 local environment like Lolcalstack i can try that too but have to learn it too.
  • (optional) try to rewrite the reading logic that we don't have to import first and then process but that we can push down the predicates. I made an stack-overflow question, maybe you can help me to do/uderstand that @jwills? The benefit will be that we don't have to specify filters on the config level but it will come from the first query on the view which is pointing/ holding an instance to the delta table
  • (optional) try to add databricks catalog option. I don't have so much experience with that but can learn it on the way
  • do testing after we agree on the structure

Be free to add new ideas, topics

I am not used to PR process in the github so feel free to rewrite, do stuff as it fits the needs and best practices

How would https://duckdb.org/2024/06/10/delta.html the new delta kernel work here to simplify and perhaps make the access to delta based data more performant?

A: https://duckdb.org/docs/extensions/delta#supported-duckdb-versions-and-platforms simply adding the extension (if the platform is supported)