datafold/data-diff

Python API diff_tables() throws duckdb.BinderException

cmcnicoll opened this issue · 1 comments

Describe the bug
Getting an error when trying the following example using DuckDB:

table1 = connect_to_table('postgresql:///', 'Rating', 'id')
list(diff_tables(table1, table1))
[]

Code:

import duckdb  # 0.10.2
from data_diff import connect_to_table, diff_tables  # 0.11.1

with duckdb.connect("test.duckdb") as con:
    con.sql("drop table if exists test_table")
    con.sql(
        "create table test_table as select * from read_csv('test/*.csv', header = true)"
    )
    con.sql("show all tables").show()
    con.table("test_table").show()

test_table = connect_to_table(
    "duckdb://test.duckdb", "test.main.test_table", ("test_id")
)
print(test_table, "\n")

list(diff_tables(test_table, test_table))

Output:

$ py test_data_diff.py 
┌──────────┬─────────┬────────────┬───────────────────────┬───────────────────┬───────────┐
│ database │ schema  │    name    │     column_names      │   column_types    │ temporary │
│ varchar  │ varchar │  varchar   │       varchar[]       │     varchar[]     │  boolean  │
├──────────┼─────────┼────────────┼───────────────────────┼───────────────────┼───────────┤
│ test     │ main    │ test_table │ [test_id, test_value] │ [BIGINT, VARCHAR] │ false     │
└──────────┴─────────┴────────────┴───────────────────────┴───────────────────┴───────────┘

┌─────────┬────────────┐
│ test_id │ test_value │
│  int64  │  varchar   │
├─────────┼────────────┤
│       1 │ a          │
│       2 │ b          │
│       3 │ c          │
└─────────┴────────────┘

TableSegment(database=DuckDB(default_schema='main', _interactive=False, is_closed=False,
_dialect=Dialect(_prevent_overflow_when_concat=False), _args={'filepath': ''},
_conn=<duckdb.duckdb.DuckDBPyConnection object at 0x000001AD984E6C30>), table_path=('test', 'main', 'test_table'),
key_columns=('test_id',), update_column=None, extra_columns=(), ignored_columns=frozenset(), min_key=None, max_key=None,
min_update=None, max_update=None, where=None, case_sensitive=True, _schema=None)

Traceback (most recent call last):
  File "C:\code\duckdb\data-diff\test_data_diff.py", line 17, in <module>
    list(diff_tables(test_table, test_table))
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\diff_tables.py", line 95, in __iter__
    for i in self.diff:
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\diff_tables.py", line 266, in _diff_tables_wrapper
    raise error
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\diff_tables.py", line 236, in _diff_tables_wrapper
    table1, table2 = self._threaded_call("with_schema", [table1, table2])
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\diff_tables.py", line 51, in _threaded_call
    return list(self._thread_map(methodcaller(func), iterable))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmcni\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 619, in result_iterator
    yield _result_or_cancel(fs.pop())
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmcni\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 317, in _result_or_cancel
    return fut.result(timeout)
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmcni\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\cmcni\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\cmcni\AppData\Local\Programs\Python\Python312\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\table_segment.py", line 153, in with_schema
    return self._with_raw_schema(self.database.query_table_schema(self.table_path))
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\databases\base.py", line 1048, in query_table_schema
    rows = self.query(self.select_table_schema(path), list, log_message=path)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\databases\base.py", line 996, in query
    res = self._query(sql_code)
          ^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\databases\duckdb.py", line 141, in _query
    return self._query_conn(self._conn, sql_code)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\databases\base.py", line 1188, in _query_conn
    return apply_query(callback, sql_code)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\databases\base.py", line 211, in apply_query
    return callback(sql_code)
           ^^^^^^^^^^^^^^^^^^
  File "C:\code\duckdb\.venv\Lib\site-packages\data_diff\databases\base.py", line 1173, in _query_cursor
    c.execute(sql_code)
duckdb.duckdb.BinderException: Binder Error: Catalog "test" does not exist!

Describe the environment
Windows 11 Pro
Python 3.12.2
data_diff 0.11.1
duckdb 0.10.2

Hi @cmcnicoll,

Thank you for trying out data-diff and for taking the time to open this issue. We made a hard decision to sunset the data-diff package and won't provide further development or support. Diffing functionality will continue to be available in Datafold Cloud. However, DuckDB connector is not yet supported in the cloud (on the roadmap).

Feel free to contact us at support@datafold.com if you have any questions.

-Gleb