dbt-labs/dbt_metrics

[Bug] `start_date` and `end_date` don't limit partition scanning in BigQuery

Closed this issue · 3 comments

Is this a new bug in dbt_metrics?

  • I believe this is a new bug in dbt_metrics
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Using start_date and end_date in metrics calls where the underlying model is a BigQuery date-partitioned table does not limit data scanning.

I believe this is because of the way the start_date and end_date fields are rendered in the WHERE clause of the metric SQL (as the generic cast(<timestamp_col> as date)). I think BQ isn't quite sure what to do with "generic" type casting when it comes to time types, versus their built-in type casting methods. Using DATE(<timestamp_col>) instead does limit partition scanning

Expected Behavior

A limit in data scanning when using start_date and end_date clauses.

Steps To Reproduce

  • Create a metric based on a BigQuery date-partitioned model
  • Call the metric specifying start_date and/or end_date
  • Data scanning should show a full column scan for column(s) specified in the metric

Relevant log output

No response

Environment

- dbt-adapter & version: dbt-bigquery==1.2.0
- dbt_metrics version: `master` (pulled September 12)

Which database adapter are you using with dbt?

bigquery

Additional Context

Relevant Slack thread: https://getdbt.slack.com/archives/C02CCBBBR1D/p1663019049038639

@Jsturgeo would you be willing to test out the differences in table scans between the two queries in the BigQuery UI (and send the scanned data screenshots afterwards)? Before we implement I want to make sure that this fix is the right solution to move forward with.

@Jsturgeo following up on this one if you can provide the table scan difference! Is this something you can provide? In order for us to get this over the line for the 0.4.0 release we'll need to confirm its position in our next sprint, which would be next week.

Copying info over from some chats in the dbt slack.


I'm not surprised that there wasn't anything in the docs -- this is largely BQ tribal knowledge. I would recommend trying it for yourself if you have access to a partitioned BQ table (I looked through the BQ open datasets for an example, but none of the tables are partitioned, unfortunately).

If you use the following clause to attempt to partition the table, the bytes scanned estimate (and, indeed, the actual number of bytes scanned) will be as though you are scanning all partitions

cast(<timestamp_col> as date) >= cast('<start_date>' as date)
and cast(<timestamp_col> as date) <= cast('<end_date>' as date)  

but if you use something like the following, partition scanning will be reduced:

date(<timestamp_col>) >= date('<start_date>')
and date(<timestamp_col>) <= date('<end_date>')  

I played with this extensively a couple of years ago, but it seems that the general rule of thumb is that partition scanning will be reduced if BQ can evaluate the date parts before query execution. This means that the following examples won't reduce partition scanning:

  • TIMESTAMP('<timestamp_string>', timezone) where timezone is a column
  • Casting as date (as above)
  • Using a timestamp column to try to limit partition scanning

But the following examples will:

  • TIMESTAMP('<timestamp_string>', <timezone>) where timezone is a fixed string
  • Using the BQ DATE operator
  • Using EXECUTE IMMEDIATE to fetch timestamps to then use as partitioning

Unfortunately I don't have access to a bigquery partitioned table at the moment and this isn't priority enough to go requisition that access. In the meantime, if anyone is running into this same issue and wants to tackle the discovery on this, I'd love to get some confirmation of the above so that we have the validation to make this change.