Data-diff fails for Motherduck connections when using multiple databases as environments
rob-teeuwen opened this issue · 3 comments
Describe the bug
When connecting to Motherduck, query_table_schema()
throws an assertion error if the table path exists in more than 1 database.
I have a motherduck project with multiple databases, 1 per environment. I use these environments to materialize dbt models. In this example, I have a schema called dbt_marts
with a table called orders
. This exists in 3 databases, namely production
, development
, and local_test
. The schema in all 3 tables is exactly the same.
I came across this bug when experimenting with data-diff.
data-diff-config.toml
:
[database.motherduck]
driver = "duckdb"
filepath = "md:production?motherduck_token=${MOTHERDUCK_TOKEN}"
[run.test]
# Source 1 ("left")
1.database = "motherduck"
1.table = "dbt_marts.orders"
# Source 1 ("right")
2.database = "motherduck"
2.table = "dbt_marts.orders"
then I ran the command: data-diff --conf data-diff-config.toml --run test2
which throws:
Traceback (most recent call last):
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/bin/data-diff", line 8, in <module>
sys.exit(main())
^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 344, in main
return _data_diff(
^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 482, in _data_diff
schemas = list(differ._thread_map(_get_schema, safezip(dbs, table_paths)))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 619, in result_iterator
yield _result_or_cancel(fs.pop())
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 317, in _result_or_cancel
return fut.result(timeout)
^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 449, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/Users/robteeuwen/.pyenv/versions/3.11.0/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/__main__.py", line 77, in _get_schema
return db.query_table_schema(table_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/robteeuwen/Library/Caches/pypoetry/virtualenvs/analytics-dagster-p7XfvFOJ-py3.11/lib/python3.11/site-packages/data_diff/databases/base.py", line 1042, in query_table_schema
assert len(d) == len(rows)
AssertionError
I traced it back to this line. It takes the output of the information_schema query over here and converts it to a dictionary. This line verifies that the number of rows in the dictionary (representing columns in the schema) is the same as the number of rows in the original result.
This fails in my case, because the information_schema returns the columns from all that meet WHERE table_name = '{name}' AND table_schema = '{schema}'
, regardless of the database in which they're in. During the dictionary conversion however, the set of rows is reduced to unique ones, so the assertion fails.
Not sure how this should be fixed. Not sure if it's expected that the information_schema returns columns from tables in other databases than the one specified in the connection parameters. One way to avoid this error is modifying the where clause as follows:
WHERE table_name = '{name}' AND table_schema = '{schema}' AND table_catalog = '[database-name]'
Describe the environment
Running MacOS with data-diff version v0.10.1 and duckdb 0.9.2
Thanks for submitting this issue with such a thorough explanation. I'll take this week to reproduce and fix. I'll follow up with any questions!
I replicated this bug with the example configs below. You made a great suggestion on the fix and that was all it took. Note that you'll need to provide the full table path: database.schema.table
vs. your original example: schema.table
to make this work correctly. This makes sense given your situation with different database names but consistent schema and table names.
Can you try installing from my public feature branch and verify it looks and feels the way you want before I wrote some tests?
# in your working git repo run these commands
git clone https://github.com/datafold/data-diff.git #clone source
cd data-diff # change directory to source
git checkout fix-motherduck-bug # checkout feature branche
cd .. # go back to your dbt project
pip install -e ../data-diff # install from source
data-diff --conf data-diff-config.toml --run test2 --debug # run your test command
[database.motherduck]
driver = "duckdb"
filepath = "md:datafold_demo?motherduck_token=${MOTHERDUCK_TOKEN}" # recommended for motherduck connection
[run.test]
# Source 1 ("left")
1.database = "motherduck"
1.table = "data_diff_demo.development.raw_orders"
# Source 1 ("right")
2.database = "motherduck"
2.table = "datafold_demo.development.raw_orders"
@rob-teeuwen this is fixed with the latest version of data-diff: 0.11.0
Can you diff again with the latest version?
pip install --upgrade data-diff