dbt-labs/dbt-core

Model freshness should be a first-class citizen, like source freshness

joellabes opened this issue · 2 comments

Describe the feature

We have a model which is built on top of ~50 source tables, each representing a single month of usage. Instead of adding each of those input tables as sources, we use the dbt_utils.get_relations_by_pattern and dbt_utils.union_relations macros.

This means that the model springs out fully formed instead of having source lineage.
image

The best/only way to do freshness checks on a models is a dbt_utils.recency test, but that can be swallowed up in the background noise of daily errors and isn't treated as a freshness issue in metadata tiles. We just found that Fivetran hadn't been syncing the September table, a week in 😬. If it had been a freshness issue, we would have been all over it much earlier.

We could define utcs_base_unioned as a source and put freshness expectations on it, but it would lead to a funny looking DAG (to say nothing of the fact that each user has their own copy of that model, unlike sources which are a shared resource).

Instead, I want to be able to do is define freshness rules for a model that behave in the same way as sources' freshness. I doubt they'd be used outside of dynamically generated models like this, but it's a use case that slips through the cracks right now.

Describe alternatives you've considered

As above

Additional context

I know you just rationalised the name of the task in #3554. It'd be confusing to have non-sources handled under the source command, but idk what to do instead.

Who will this benefit?

  • End users of the metadata tile, who won't see a Data freshness passed message when their data is actually a week stale
  • Barr, who will have to put up with less of my moaning about freshness semantics
  • Developers using dynamic relations instead of true sources

Are you interested in contributing this feature?

Maybe?

@joellabes Thanks for the thorough writeup! I think there are two different issues at play here:

  1. Should it be possible to define dynamic source/model dependencies? Yes, absolutely. It's not a capability we've been able to prioritize, but we've discussed it at length: #1212, #3072. I don't want you to have to sever source-model lineage, just to have a model that unions together a bunch of source tables. FWIW I don't think we'll be able to support the introspective query-based approach (get_relations_by_pattern) with intact lineage, but certainly an approach that lets you say: for all graph.sources matching these criteria, source 'em.

  2. Assuming it's possible for models to correctly establish their source lineage, should models have their own concept of "freshness"? I'm less sure about this one!

I think model "freshness" needs to be a concept synthesized out of two discrete things:

  • How recently raw data was loaded into upstream sources
  • How recently the model was last run, if it's a table (or even more precisely: how recently all tables upstream of it were built)

Put another way, the latency of my_model is greatest(source_staleness, model_staleness) for all resources +my_model. If you could reliably define and know that latency, and determine the cause, you could:

  • send a notification, if the staleness comes from sources
  • trigger models to run, if the staleness comes from models (see: the second half of Benn's tweet)

This feels conceptually similar to (if also the inverse of) the source: pseudo selector discussed in #2465 (comment). In that version of the world, regularly running jobs could just select resources that are downstream of sources with net-fresh data.

Some databases offer this metadata natively (e.g. Snowflake offers last_altered). dbt definitely offers this metadata, through the combination of sources.json and run_results.json. I think @barryaron may well have a role to play here after all ;)

for all graph.sources matching these criteria, source 'em.

I'm mostly on board with this. The downside is that right now every new table is automatically incorporated without any extra bookkeeping. Our prod DB is like this:
wallace-and-gromit-train-tracks
A process automatically makes next month's table a month in advance - we're not bulk generating the whole set at once. If I defined the next 3 years of sources in dbt, 35 of them wouldn't exist and I'd get an error if I tried to select from them.

I could keep the introspective query and do something like for all graph.sources where source.name in relations_that_actually_exist to avoid referencing a table that doesn't exist yet.

trigger models to run, if the staleness comes from models (see: the second half of Benn's tweet)

I would love this. However, I don't think it would handle the case where the model has been rebuilt successfully, but there's nothing new in it because the September 2021 table isn't available to dbt (either Fivetran hasn't synced it, or the dbt_user account doesn't have permission to access it). And again, I wouldn't be able to put freshness tests on all the pre-defined sources because they don't exist yet - from my experimentation, you're allowed to have non-existent tables defined as sources as long as you don't try to use them.

Is that fair? Or have I missed something?