Add `--prod_db` flag option for --dbt diffs
mimoyer21 opened this issue · 8 comments
Is your feature request related to a problem? Please describe.
We have multiple prod databases that we frequently want to run data-diffs against. With the current options for setting the prod_database
value (the two options outlined here in the docs), it's cumbersome to switch between prod_databases.
Describe the solution you'd like
Add a flag that can be passed in for --dbt
diffs that allows us to define the prod_database
data-diff should compare against.
Currently the prod_database to diff against can be defined in the two ways outlined here in the docs. This would be a third way to define the prod_database to diff against.
I'm envisioning the command being something like: data-diff --dbt --prod_db <my_prod_db_name>
Describe alternatives you've considered
The other ways to switch which prod database we diff against are:
- update the prod_database variable in the dbt_project.yml file each time
- run
dbt ls -t prod
to produce a manifest.json file and then run data diff with the--state
flag (as outlined here)dbt ls -t prod
takes many minutes to run and makes for a frustrating development process for the data engineers on the team who are trying to use data-diff. Ultimately they end up skipping the data-diff if it's this slow
Additional context
Discussed with @vvkh in this Slack thread
Thanks for opening the issue. Makes sense to me. Is the production database the only variable that changes or do you have dynamic schema names based on the production manifest as well?
We happen to have a consistent schema naming pattern across production databases, so for us the prod_custom_schema
defined in dbt_project.yml
doesn't need to change, but I could see it being useful to have a flag for that too in case other data-diff users have different schema naming patterns for their prod databases.
Got it. I tried with environment variables but it didn't work as data-diff
reads the contents of the dbt_project.yml
vars directly vs. reading from the manifest(good thing for performance).
This feels like a reasonable feature request. I'm open to building this out. Something like the below.
data-diff --dbt --prod-database XYZ
/ data-diff --dbt -pd XYZ
data-diff --dbt --prod-schema XYZ
/ data-diff --dbt -ps XYZ
data-diff --dbt -pd XYZ -ps ABC
/data-diff --dbt --prod-database XYZ --prod-schema ABC
@dlawin I'm happy to own this if you good about the problem/approach!
Awesome, that output you laid out looks great to me.
Got it. I tried with environment variables but it didn't work as
data-diff
reads the contents of thedbt_project.yml
vars directly vs. reading from the manifest(good thing for performance).This feels like a reasonable feature request. I'm open to building this out. Something like the below.
data-diff --dbt --prod-database XYZ
/data-diff --dbt -pd XYZ
data-diff --dbt --prod-schema XYZ
/data-diff --dbt -ps XYZ
data-diff --dbt -pd XYZ -ps ABC
/data-diff --dbt --prod-database XYZ --prod-schema ABC
@dlawin I'm happy to own this if you good about the problem/approach!
Makes sense to me!
this is similar to the --where --include
flags in that they override what would otherwise be supplied via parsing:
Lines 202 to 205 in 4f8e5c9
@mimoyer21 We're in business. I'll update this and likely slack you in dbt public slack to test it out on my feature branch!