This add-on package enhances dbt
by providing macros which programmatically select columns
based on their column names. It is inspired by the across()
function
and the select helpers
in the R package dplyr
.
dplyr
(>= 1.0.0) has helpful semantics for selecting and applying transformations to variables based on their names.
For example, if one wishes to take the sum of all variables with name prefixes of N
and the mean of all variables with
name prefixes of IND
in the dataset mydata
, they may write:
summarize(
mydata,
across( starts_with('N'), sum),
across( starts_with('IND', mean)
)
This package enables us to similarly write dbt
data models with commands like:
{% set cols = dbtplyr.get_column_names( ref('mydata') ) %}
{% set cols_n = dbtplyr.starts_with('N', cols) %}
{% set cols_ind = dbtplyr.starts_with('IND', cols) %}
select
{{ dbtplyr.across(cols_n, "sum({{var}}) as {{var}}_tot") }},
{{ dbtplyr.across(cols_ind, "mean({{var}}) as {{var}}_avg") }}
from {{ ref('mydata') }}
which dbt
then compiles to standard SQL.
Alternatively, to protect against cases where no column names matched the pattern provided
(e.g. no variables start with n
so cols_n
is an empty list), one may instead internalize the final comma
so that it is only compiled to SQL when relevant by using the final_comma
parameter of across
.
{{ dbtplyr.across(cols_n, "sum({{var}}) as {{var}}_tot", final_comma = true) }}
Note that, slightly more dplyr
-like, you may also write:
select
{{ dbtplyr.across(dbtplyr.starts_with('N', ref('mydata')), "sum({{var}}) as {{var}}_tot") }},
{{ dbtplyr.across(dbtplyr.starts_with('IND', ref('mydata')), "mean({{var}}) as {{var}}_avg") }}
from {{ ref('mydata') }}
But, as each function call is a bit longer than the equivalent dplyr
code, I personally find the first form more readable.
The complete list of macros included are:
Functions to apply operation across columns
across(var_list, script_string, final_comma)
c_across(var_list, script_string)
Functions to evaluation condition across columns
if_any(var_list, script_string)
if_all(var_list, script_string)
Functions to subset columns by naming conventions
starts_with(string, relation or list)
ends_with(string, relation or list)
contains(string, relation or list)
not_contains(string, relation or list)
one_of(string_list, relation or list)
not_one_of(string_list, relation or list)
matches(string, relation)
everything(relation)
where(fn, relation)
wherefn
is the string name of a Column type-checker (e.g. "is_number")
Note that all of the select-helper functions that take a relation as an argument can optionally be passed a list of names instead.
Documentation for these functions is available on the package website and in the macros/macro.yml
file.