dbt-labs/dbt_metrics

[SEMANTIC-129] [Feature] dbt_metrics should be able to provide non-time aggregations

Closed this issue · 1 comments

Is this your first time submitting a feature request?

  • I have read the expectations for open source contributors
  • I have searched the existing issues, and I could not find an existing issue for this feature
  • I am requesting a straightforward extension of existing dbt functionality, rather than a Big Idea better suited to a discussion

Describe the feature

Right now, metrics inside of dbt-core require the timestamp property. This is used in our dbt_metric query generation as the field that we join to the calendar table in order to date-spine the returned dataset. But this issue in dbt-core proposes a future where that isn't required!

The State of the World Today

Today, our metrics macros take the following inputs:

Input Required Description
metric_list yes This is a list of the metrics that the user wants to query. It takes in either a string or list of metric objects, a la metric('metric_name').
grain yes This is the time grain that the metric will be aggregated at. It relies on the fields in the default/custom calendars provided and then the list of time_grains defined on top of the metric.
dimensions no This is a list of dimensions that the metric will be aggregated at. It relies on the dimensions property of the metric and we validate that the inputs match what has been provided in the metric definition.
secondary_calculations no This accepts a list of macro inputs that create new outputs based on secondary calculations such as rolling or period over period. This functionality requires that our data is date-spined with 0 values for dates where there is no data.
start_date no This input accepts a string date and ensures that the data being queried is only after this date
end_date no This input accepts a string date and ensures that the data being queried is only before this date
where no This is a string where clause that can be added to any metric query. It is applied in the final CTE, which means it can only reference dimensions provided or the metric/date values.

This produces metric queries based on the inputs. But let’s walk through the components of the query generation:

Metric Generation

  • Calendar Table: This CTE pulls from either the default calendar model that is included as part of the metrics package OR the custom calendar that the user has defined as a variable in their project. We look for whatever grain the user has selected as the field date_{{grain}} in this model.
  • Base Query: This subquery pulls from the model the metric depends upon and selects all of the needed dimensions, time fields, and fields to aggregate. It also joins the calendar table based on the lowest acceptable grain of time we support ( day ). This is where we apply filters and start/end date.
  • Aggregation Query: This CTE operates on the Base Query and aggregates the metric value based upon the dimensions/grain provided.
  • Distinct Dims: This CTE selects a distinct list of all dimensions from the aggregation query. This constructs a dataset of all possible dimension combinations.
  • Date Spine: This CTE combines the Distinct Dims with the calendar table to create a dataset where each combination of dims has a representation for each grain that has been selected
  • Date Spine Aggregation: This CTE joins the Aggregation Query with the Date Spine on all the dimensions. Rows where there isn’t any value for the metric on that date/dim combination are coalesced to 0 by default .

Joins & Secondary Calcs

  • Join Metrics: If there are multiple metrics, this CTE joins them together. Additionally it handles the nesting needed for derived metrics and correctly recognizing their dependency depth in order to build them in the right order.
  • Secondary Calculations: This CTE applies all the secondary calculations that are provided in the macro. These are applied to all of the metrics that are input
  • Final Select: This CTE selects from the either the aggregation/join metrics/ secondary calculations depending on the inputs.

image

Proposal

One of the main drivers for implementing this change is the proposal to make timestamp an optional property of the metric definition. The astute among you might be asking how this aligns with the initial definition of metrics in the first issue:

A metric is a timeseries aggregation over a table that supports zero or more dimensions.

Well, this is an evolution of our definition of a metric that unlocks new behaviors that we haven’t been able to support yet. These include things like static metric aggregations or aggregations at specific entity grains, such as customer_id.

Changes:

  • We will make grain optional so that simple queries to return single numbers don’t need to enable workaround behavior with an all_time grain
  • We will simplify our default behavior and remove spining unless we find that secondary calculations are provided. This will alter the appearance of our returned datasets
    • Perhaps we offer a config that triggers this behavior for non secondary calc queries

image

Pros:

  • Simplifies the sql generated for most metric queries, reducing the size of the returned dataset and reducing the complexity of the query.
  • Support static metric aggregations
    • Better support metric aggregations at specific entity grains such as customer_id.

Cons:

  • A small chance that some BI tools don’t display the missing values, leading to scratch charts. Most BI tools worth their salt should be able to handle this though.
  • More serious branching logic in our query generation, which make it a bit trickier to develop for because there would be two paths

Are you interested in contributing this feature?

Absolutely!

All right, I'm heading out on vacation for a few weeks so it is unlikely this ticket makes it over the line before the end of the year. That said, I've got a working branch in making-timestamp-optional that is running off of an edited version of dbt-core.

The biggest takeaway after a day of messing around with this is that my dislike of SQL commas is only increasing. The dynamic nature of the query generation means that we have to account for all possible combination of values when deciding where the first/last commas are.

So I believe the fix here is going to be moving the metric value to the first column in every CTE except for the last. Previously we'd centered all CTE's around the time grain but with that gone we need a new statement starter. So the metric value(s) will take that position and we'll reorder them in the final CTE based on more common consuming patterns of date/dims/metric