datafold/data-diff

Add `--prod_db` flag option for --dbt diffs

mimoyer21 opened this issue · 8 comments

Is your feature request related to a problem? Please describe.
We have multiple prod databases that we frequently want to run data-diffs against. With the current options for setting the prod_database value (the two options outlined here in the docs), it's cumbersome to switch between prod_databases.

Describe the solution you'd like
Add a flag that can be passed in for --dbt diffs that allows us to define the prod_database data-diff should compare against.
Currently the prod_database to diff against can be defined in the two ways outlined here in the docs. This would be a third way to define the prod_database to diff against.
I'm envisioning the command being something like: data-diff --dbt --prod_db <my_prod_db_name>

Describe alternatives you've considered
The other ways to switch which prod database we diff against are:

  • update the prod_database variable in the dbt_project.yml file each time
  • run dbt ls -t prod to produce a manifest.json file and then run data diff with the --state flag (as outlined here)
    • dbt ls -t prod takes many minutes to run and makes for a frustrating development process for the data engineers on the team who are trying to use data-diff. Ultimately they end up skipping the data-diff if it's this slow

Additional context
Discussed with @vvkh in this Slack thread

Thanks for opening the issue. Makes sense to me. Is the production database the only variable that changes or do you have dynamic schema names based on the production manifest as well?

We happen to have a consistent schema naming pattern across production databases, so for us the prod_custom_schema defined in dbt_project.yml doesn't need to change, but I could see it being useful to have a flag for that too in case other data-diff users have different schema naming patterns for their prod databases.

Got it. I tried with environment variables but it didn't work as data-diff reads the contents of the dbt_project.yml vars directly vs. reading from the manifest(good thing for performance).

This feels like a reasonable feature request. I'm open to building this out. Something like the below.

data-diff --dbt --prod-database XYZ/ data-diff --dbt -pd XYZ
data-diff --dbt --prod-schema XYZ/ data-diff --dbt -ps XYZ
data-diff --dbt -pd XYZ -ps ABC/data-diff --dbt --prod-database XYZ --prod-schema ABC

@dlawin I'm happy to own this if you good about the problem/approach!

Awesome, that output you laid out looks great to me.

dlawin commented

Got it. I tried with environment variables but it didn't work as data-diff reads the contents of the dbt_project.yml vars directly vs. reading from the manifest(good thing for performance).

This feels like a reasonable feature request. I'm open to building this out. Something like the below.

data-diff --dbt --prod-database XYZ/ data-diff --dbt -pd XYZ data-diff --dbt --prod-schema XYZ/ data-diff --dbt -ps XYZ data-diff --dbt -pd XYZ -ps ABC/data-diff --dbt --prod-database XYZ --prod-schema ABC

@dlawin I'm happy to own this if you good about the problem/approach!

Makes sense to me!

dlawin commented

this is similar to the --where --include flags in that they override what would otherwise be supplied via parsing:

data-diff/data_diff/dbt.py

Lines 202 to 205 in 4f8e5c9

# cli flags take precedence over any model level config
where_filter=where_flag or datadiff_model_config.where_filter,
include_columns=cli_columns or datadiff_model_config.include_columns,
exclude_columns=[] if cli_columns else datadiff_model_config.exclude_columns,

@mimoyer21 We're in business. I'll update this and likely slack you in dbt public slack to test it out on my feature branch!

Easier to implement than expected! Got something quick and dirty working. I'll give a fuller update next week!

image