Automation scripts to accelerate dbt development, namely using code generation scripts to:
- Automate the dbt project setup process, designing the project to follow dbt's recommended project structure.
- Automate the creation of the dbt
_sources.yml
resource property file for a given data source. - Generate (dbt) SQL files in bulk either as:
snapshots
tables orincremental
loads. - Use dbt
pre-commit
packages to ensure dbt standards and naming conventions are upheld (seedbt-gloss
in.pre-commit-config.yaml
).
Doing this ensures that new dbt projects implement best practices from the off and removes much of the manual heavy lifting of dbt projects. See example_generated_dbt_project as an example dbt project generated using these scripts.
Note, re: steps 2 and 3 - there's also a routine to add new data sources to an existing dbt project. For more details, see the section below 'How to Add a New Data Source'.
- Goal
- How to Run
- How to Add a New Data Source
- Dagrules
- Implementation Details
The goal of these scripts is to accelerate dbt development through the use of code generation scripts. These scripts look to:
Automate the dbt Project Setup Process
As well as automating the dbt project setup, the project is designed to follow dbt's recommended project structure and implement best practices. Expand the menu below for more details.
Expand for more details
-
See
initialise_dbt_project
in the Makefile. -
The target
initialise_dbt_project
automates the dbt project setup process by:- Populating the
dbt_project.yml
andprofiles.yml
files & verifying the connectivity. - Providing a template
packages.yml
to bundle the install of best-practice dbt packages, e.g.:dbt_utils
dbt_expectations
dbt-codegen
dbt-project-evaluator
- Include additional (generic) dbt source tests, e.g.:
raw_table_existence
is_table_empty
- Include additional dbt macros, e.g.:
limit_row_count
- custom macro to limit row counts when in lower (e.g., dev) environmentsgenerate_schema_name
- commonly revised dbt macrogrant_select_on_schemas
- dbt-recommended macro to grant access to all tables in a schema- And recreate the target dbt project structure recommended by dbt, as shown below:
- Populating the
Click to show target dbt project structure
${DBT_PROJECT_NAME}
├── analysis
├── data
├── docs
│ └── pull_request_template.md
├── macros
│ ├── _macros.yml
│ ├── generate_schema_name.sql
│ └── grant_select_on_schemas.sql
├── models
│ ├── intermediate
│ │ ├── _int_<entity>__<verb>.yml.j2 # just a placeholder
│ │ └── example_cte.sql.j2 # placeholder
│ ├── marts
│ │ ├── _models.yml.j2 # placeholder
│ │ └── dim_customer.sql.j2 # placeholder
│ ├── staging
│ │ ├── ${DBT_PROJECT_NAME}
│ │ │ ├── ${DBT_PROJECT_NAME}__docs.md
│ │ │ ├── ${DBT_PROJECT_NAME}__models.yml
│ │ │ ├── ${DBT_PROJECT_NAME}__sources.yml
│ │ │ ├── base (TBC)
│ │ │ │ ├── base_${DBT_PROJECT_NAME}__customers.sql
│ │ │ │ └── base_${DBT_PROJECT_NAME}__deleted_customers.sql
│ │ │ ├── ${DBT_PROJECT_NAME}__customer.sql
│ └── utilities
│ └── all_dates.sql
├── snapshots
│ └── ${DATA_SRC}
│ └── ${DATA_SRC_SRC_TABLE}_snapshot.sql
├── tests
│ └── generic
│ └── sources
│ ├── existence
│ | └── raw_table_existence.sql
│ └── row_count
│ └── is_table_empty.sql
├── README.md
├── dbt_project.yml
└── packages.yml
Automate the creation of the dbt _sources.yml
resource property file for a given data source
Expand for more details
- See
gen_source_properties_file
in theMakefile
. - This step automates the creation of the dbt source properties file (i.e.,
_sources.yml
) for each data source, using the python scriptpy/gen_dbt_src_properties.py
. - A key prerequisite for this step is for the user to supply data dictionary type input file, to indicate (per table) at a field-level:
- The field description
- and flags to indicate whether the following 'generic' dbt test should be applied to the field:
- Unique
- Not null
- Accepted values
- Relationship constraints
Generate (dbt) sql files in bulk that use the snapshot
and incremental
patterns
Expand for more details
- See
gen_dbt_sql_objs
in theMakefile
. - This steps automates the creation of (dbt) SQL files in bulk (either as:
snapshot
orincremental [load]
SQL files) using Jinja templates. It does this using the python scriptpy/gen_dbt_sql_objs.py
. - As with step 2 'Generate the dbt 'source properties' file', a key prerequisite for this step is for the user to supply a data-dictionary type input file (this time at the data source-level), to indicate per source table what the:
- Primary key is
- and what the 'last_updated_field' is per table
Before you do anything, ensure you install and setup pre-commit
by running:
pip install -r requirements.txt
pre-commit install
- Ensure you provide values for each of the keys in
ip/config.yaml
.- For a description breakdown of each of the input args, see
ip/README.md
.
- For a description breakdown of each of the input args, see
- Review the input args used for the data dictionary field mapping args,
- If you used the template data dictionary file, no changes will be required.
- For more details, see
ip/data_dic_field_mapping_prereqs.md
.
-
Update the input parameters within
ip/config.yaml
. -
Upload an input data dictionary to the
ip/
folder and (if required) review the value of thedata dictionary
key withinip/data_dic_field_mapping_config.yaml
. -
Install the prerequisites libraries by running:
make deps
. -
Run
make install
to:
Expand for more details
- Set up a dbt project, designing the project th follow dbt's recommended project structure and validate source DB connectivity.
- Generate a dbt resource properties file (
_sources.yml
) using data from an input data dictionaries/metadata. - Recreate the target dbt project structure recommended by dbt.
- Generate (dbt) SQL files in bulk either as: snapshots tables or incremental loads.
If you've previously ran make install
to generate the template dbt project, you can then do the following to add a new data source to your dbt project (click menu to expand):
Expand for more details
- Update the
data_src
parameter withinip/config.yaml
(underneathgeneral_params
) to reflect the data source you want to add. - Upload an input data dictionary to the
ip
folder and ensure it matches the value of thedata dictionary
key withinip/data_dic_field_mapping_config.yaml
accordingly. - Run
make add_data_source
to:
- Generate a dbt resource properties file (
_sources.yml
) using data from an input data dictionaries/metadata. - Generate (dbt) SQL files in bulk either as: snapshots tables or incremental loads.
- and import both of these into the previously generated dbt project.
The python package dagrules
in installed as part of the build script. To use it, enter dagrules --check
.
To see what checks are performed by dagrules, check out templates/dagrules.yml.