datafold/data-diff

data-diff fails on tables join step with databricks connector

unidentified-user opened this issue · 3 comments

Description:
I am encountering an issue with the data-diff command failing on the tables join step when working with Databricks.
It appears that the key column data type is identified as Text(_notes=[]), which causes the process to fail.

Command:
data-diff -d --dbt --select <path_to_my_model>.

Output:
NotImplementedError: Cannot use a column of type Text(_notes=[]) as a key
You can see the full log in the attachments.

Environment:

  • MacOS Sonoma 14.0
  • python 3.9.18
  • data-diff 0.9.17
  • dbt-databricks 1.5.6
  • dbt-core 1.5.9
  • databricks-sql-connector 3.0.0

Attachments:

Hey @unidentified-user

Can you send a screenshot of examples of your primary key column that's being interpreted as text? I tested the below against my databricks sql warehouse and it interpreted it as String_VaryingAlphanum and worked just fine.

I had to use these dbt versions as the ones you listed didn't play nicely with Databricks SQL warehouses.
data-diff 0.9.17
databricks-sdk 0.9.0
databricks-sql-connector 2.9.3
dbt-core 1.7.4
dbt-databricks 1.7.3

test seed data

org_id,event_timestamp,activity
how,2022-03-29T09:04:05Z,feature A
are,2022-03-29T20:24:28Z,feature A
you,2022-03-31T15:13:42Z,feature A
doing,2022-03-31T23:01:37Z,feature A
today,2022-04-01T15:05:01Z,feature A
14:01:05 DEBUG    Running SQL (Databricks):                                                                                                                                                                                                                    base.py:959
                  SELECT * FROM (SELECT TRIM(`org_id`), TRIM(`activity`) FROM `hive_metastore`.`prod_sung`.`ref_demo_seed`) AS LIMITED_SELECT LIMIT 64                                                                                                                    
         INFO     [Databricks] Schema = {'org_id': String_VaryingAlphanum(_notes=[]), 'event_timestamp': Timestamp(_notes=[], precision=6, rounds=True), 'activity': String_VaryingAlphanum(_notes=[])}                                                       schema.py:12
14:01:06 DEBUG    Running SQL (Databricks):                                                                                                                                                                                                                    base.py:959
                  SELECT * FROM (SELECT TRIM(`org_id`), TRIM(`activity`) FROM `hive_metastore`.`dev_sung`.`ref_demo_seed`) AS LIMITED_SELECT LIMIT 64                                                                                                                     
         INFO     [Databricks] Schema = {'org_id': String_VaryingAlphanum(_notes=[]), 'event_timestamp': Timestamp(_notes=[], precision=6, rounds=True), 'activity': String_VaryingAlphanum(_notes=[])}      

Hey @sungchun12

Thank you for the response.
I eventually realized the issue was with my key not being alphanumeric, as it contains a colon sign (:).
See the example below:

image

Glad you figured it out! Closing this issue then :)