databricks/delta-live-tables-notebooks

Could you please explain what is the fundemental difference between the usage of dlt.read and dlt.readStream?

Trodenn opened this issue · 1 comments

So our team at the moment would like to set up a DLT pipeline to achieve real time performance in data collection and analysis. We have a kafka connection set up as you did with readStream

The first DLT that we have returns the above mentioned kafka port. (Let us call it the "bronze_table")

Now further downstream we have other tables that need to read from that bronze_table.

  1. If I use dlt.readStream("bronze_table")
  • Does my 2nd table who reads from bronze_table only ingest the newly added data and does not consider the old data that was already here defined by the .option("startingOffsets", "earliest")? (i.e will I have the earliest offset with this read mode)
  1. If I use dlt.read("bronze_table")
  • If I run with this command instead, will my second DLT read the entirety of of the bronze_table once from the earliest offset to the most recent one when I ran the pipeline and then cease to update? (I know there is the CDC but honestly I am not sure when or when not to use that; the same thing applies for the SCD that is involved. If anyone can give an explanation on that too that would be perfect)
  • When the pipeline is running in "continous" mode instead of triggered. Does that mean the 2nd delta table that uses dlt.read (in scenario 2) will keep on updating to the newest data from the bronze_table?

please there is really not enough documentation about all this in general. I would appreciate any kind of feedback on this matter.
If you need more info I can provide that too

dlt.read reads the table as a batch dataframe, reads the source entirely every run, and fully materializes the result of the query in the target table.

dlt.readStream incrementally reads the sources as streaming data sources, checkpoints its progress, and appends to the target table