DBT-Fundamentals
- play with packages as a way to compose pipelines and distribute/manage aside from a monorepo
Directory Structure: ./doom/models/staging/jaffle_shop/stg_jaffle_shop.yml
vs ./doom/models/staging/stg_jaffle_shop.yml
... the naming convention is unclear to me what matters and what is referenced where. I THINK that the directory jaffle_shop
is used in, and only in, dbt_project.yml
for simpler configuration of a group of models, but otherwise model name and model reference is always done by the filename...
- Make a virtual environment
pip install -r requirements.txt
to setup youdbt-core
anddbt-sqlite
python init_db.py
to initialize the sqlite databasedb.db
in./data
This will create the customers
, orders
, and stripe
tables in the sqlite
database for dbt transformations to consume as sources
You need to configure the sqlite connector in ~/.dbt/profiles.yml
.
outputs:
dev:
type: sqlite
threads: 1
database: "database"
schema: "main"
schemas_and_paths:
main: "/Users/npayne81/work/dbt-fundamentals/data/main.db"
schema_directory: "/Users/npayne81/work/dbt-fundamentals/data"
doom
is the dbt project directory so cd doom
and dbt build
Pro-tip:
pip install visidata
then usevd
to exploredata/data.db
at the terminal!
To run tests on selected models dbt test --select <model_1> <model_2> ...
dbt ships with four built in tests: unique, not null, accepted values, relationships. Unique tests to see if every value in a column is unique Not_null tests to see if every value in a column is not null Accepted_values tests to make sure every value in a column is equal to a value in a provided list Relationships tests to ensure that every value in a column exists in a column in another model (see: referential integrity)
Tests can be run against your current project using a range of commands: dbt test runs all tests in the dbt project dbt test --select test_type:generic dbt test --select test_type:singular dbt test --select one_specific_model
dbt run --select <model>
will run for just that model. There isn't a separate
api like Kedro has for from-nodes
or to-nodes
, but to run downstream models
following the ref
function you would do dbt run --select <model>+
In working on this project, we established some conventions for naming our models.
Sources (src) refer to the raw table data that have been built in the warehouse through a loading process. (We will cover configuring Sources in the Sources module) Staging (stg) refers to models that are built directly on top of sources. These have a one-to-one relationship with sources tables. These are used for very light transformations that shape the data into what you want it to be. These models are used to clean and standardize the data before transforming data downstream. Note: These are typically materialized as views. Intermediate (int) refers to any models that exist between final fact and dimension tables. These should be built on staging models rather than directly on sources to leverage the data cleaning that was done in staging. Fact (fct) refers to any data that represents something that occurred or is occurring. Examples include sessions, transactions, orders, stories, votes. These are typically skinny, long tables. Dimension (dim) refers to data that represents a person, place or thing. Examples include customers, products, candidates, buildings, employees. Note: The Fact and Dimension convention is based on previous normalized modeling techniques.