dbt source snapshot-freshness throws an error with empty tables.

Question

dbt source snapshot-freshness throws an error with empty tables.

Closed this issue 4 years ago · 4 comments

Describe the feature

Using a freshness check in a source yml such as the following:

      freshness:
        error_after: {count: 2, period: hour}

      loaded_at_field: updated_ts

will return "ERROR STALE" if the source table is empty. This prevents being able to proactively define sources that are currently empty without causing the entire dbt source snapshot-freshness command to fail until rows start to appear in the source table.

Being able to define a condition on the freshness check, such as checking that the row count is greater than 0, would allow the freshness check to be set up for empty tables that are expected to be populated in the future. Alternatively, a built-in WARN or other status if the table is empty could allow the overall freshness check to pass.

Describe alternatives you've considered

Current workaround is to leave the freshness check out until rows appear, then later on goign back and implementing the check once rows start to appear.

Additional context

Not database specific.

Who will this benefit?

Data engineers who are building up a data warehouse which is in the process of setting up data ingestion.

Answer 1 · 2020-05-11T19:29:23.000Z

@Swibles You raise an interesting point!

dbt source snapshot-freshness will return an error when there has not been data loaded within a certain window of time. If there have been 0 rows loaded within the defined window, because there have been 0 rows loaded since time immemorial, I would expect the same error.

This prevents being able to proactively define sources that are currently empty without causing the entire dbt source snapshot-freshness command to fail until rows start to appear in the source table.

Could you say more about your desired use case here? I totally buy why you want to define sources proactively; what I'm less sold on is that you would want snapshot-freshness to pass for a source table that contains no data.

Alternatively, a built-in WARN or other status if the table is empty could allow the overall freshness check to pass.

You could accomplish this today by including only warn_after (and excluding error_after) on sources you're defining proactively.

Another option: You could write a custom schema test, e.g. test_not_empty, that checks to see whether count(*) > 0 in a table.

Answer 2 · 2020-05-12T17:52:35.000Z

@jtcohen6 Thanks for your response.

When we run periodic freshness checks, we want an ERROR or FAIL if there is something wrong so that we can react to it. I assert that there is nothing inherently wrong with an empty table, but as soon as rows start appearing in the table and stop being ingested on a regular basis, then at that point we would want to see an ERROR or FAIL.

Without splitting up the freshness check into multiple commands (or even multiple jobs) we see the freshness checks fail over and over and would need to manually drill into the job on dbt cloud each time to ensure it is only the empty tables that are failing.

The warn_after suggestion would definitely help with the failures, but as soon as table starts seeing rows then we would need to go back to the source definition and submit another PR just to bring it back to an error. I would want to be able to set up the freshness check in the source definition once, and have dbt handle the tables being empty vs not empty without needing additional work/PRs.

Thank you for the custom schema test option as well, however I understand that this would not affect the dbt source snapshot-freshness command's status and would still require adjusting the freshness checks once rows start showing up.

Answer 3 · 2020-05-12T19:31:39.000Z

Thanks for the additional context. You're right that both the warn_after and custom schema test options would require later intervention, once the data has started populating. My thinking was that they'd be a good way to keep tabs on when it switches from empty to populated, so that you could know when to intervene.

It sounds like what you're after is different from what may make sense in the general case. (I'm willing to be proven wrong!) In the meantime, if you want, you may be able to accomplish the desired behavior in your own project by overriding the collect_freshness macro:

https://github.com/fishtown-analytics/dbt/blob/9d0eab630511723cd0bc328f6f11d3ffe6c8f879/core/dbt/include/global_project/macros/adapters/common.sql#L287-L303

You could rewrite the core of that query to look something like:

select
    case when count(*) = 0 then {{ current_timestamp() }} else max({{ loaded_at_field }}) end as max_loaded_at,
    {{ current_timestamp() }} as snapshotted_at
from {{ source }}

That query will PASS if the table is empty. You could make it WARN instead by adding an offset to the first {{ current_timestamp() }}, based on the value of your warn_after threshold.

Answer 4 · 2020-05-13T17:03:56.000Z

Thanks for the suggestion @jtcohen6 , much appreciated. Closing this issue.