Databricks connection does not close properly when using --dbt
leoebfolsom opened this issue · 2 comments
Describe the bug
The connection stays open for a while until it dies. Diff results are successfully returned to the user, along with an error message.
Version info:
(datafold-demo) ➜ datafold-demo git:(test-2) data-diff --version
v0.11.1
Excerpt of debug output:
Running with data-diff=0.11.1
08:28:30 INFO Parsing file dbt_project.yml dbt_parser.py:287
INFO Parsing file dbt_parser.py:280
/Users/leoebfolsom/git/leoebfolsom-databricks-demo/datafold-demo/target/manifest.json
INFO Parsing file target/run_results.json dbt_parser.py:253
INFO config: prod_database='dev' prod_schema='dbt_leoebfolsomdatabricksdemo_prod' dbt_parser.py:159
prod_custom_schema=None datasource_id=7498
INFO Parsing file /Users/leoebfolsom/.dbt/profiles.yml dbt_parser.py:294
DEBUG Found PKs via Tags [dim_orgs]: ['org_id'] dbt_parser.py:454
INFO [Databricks] Starting a threadpool, size=1. base.py:1237
DEBUG Running SQL (Databricks): base.py:982
SET TIME ZONE 'UTC'
08:28:34 DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`)
AS LIMITED_SELECT LIMIT 64
DEBUG Mixed Alphanum/Non-Alphanum values detected in column base.py:1153
dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs.sub_plan. It cannot be used as a key.
INFO [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>), schema.py:44
'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],
precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,
rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],
precision=0, python_type=<class 'int'>)}
08:28:35 DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') base.py:980
SELECT * FROM (SELECT TRIM(`sub_plan`) FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`)
AS LIMITED_SELECT LIMIT 64
DEBUG Mixed Alphanum/Non-Alphanum values detected in column base.py:1153
dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs.sub_plan. It cannot be used as a key.
INFO [Databricks] Schema = {'org_id': Integer(_notes=[], precision=0, python_type=<class 'int'>), schema.py:44
'created_at': Timestamp(_notes=[], precision=6, rounds=True), 'num_users': Integer(_notes=[],
precision=0, python_type=<class 'int'>), 'sub_created_at': Timestamp(_notes=[], precision=6,
rounds=True), 'sub_plan': Text(_notes=[], collation=None), 'sub_price': Integer(_notes=[],
precision=0, python_type=<class 'int'>)}
DEBUG Testing for duplicate keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> joindiff_tables.py:230
('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): base.py:982
SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`
INFO Validating that the are no duplicate keys in columns: ['org_id'] for ('dev', joindiff_tables.py:243
'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS
`total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs`
DEBUG Running SQL (Databricks): base.py:982
SELECT cast(min(`org_id`) as string), cast(max(`org_id`) as string) FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`
INFO Diffing segments at key-range: (100000001)..(100000158). size: table1 <= 157, table2 <= 157 diff_tables.py:318
INFO . Diffing segment 1/32, key-range: (100000001)..(100000005), size <= None joindiff_tables.py:183
DEBUG Collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') joindiff_tables.py:270
DEBUG Querying for different rows: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') joindiff_tables.py:208
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS
`count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >= 100000001)
AND (`org_id` < 100000005)
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT * FROM (SELECT (`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS
`is_exclusive_b`, CASE WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS
`is_diff_org_id`, CASE WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0
END AS `is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1
ELSE 0 END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from
`tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN
`tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,
CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS
`is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)
AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as
string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`
as string) AS `sub_plan_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,
cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS
`num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE
(`org_id` >= 100000001) AND (`org_id` < 100000005)) `tmp1` FULL OUTER JOIN (SELECT * FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND (`org_id` <
100000005)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR
(`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR
(`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))
INFO Validating that the are no duplicate keys in columns: ['org_id'] for ('dev', joindiff_tables.py:243
'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') base.py:980
SELECT count(*) AS `total`, count(distinct coalesce(cast(`org_id` as string), '<null>')) AS
`total_distinct` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs`
08:28:36 DEBUG Done collecting stats for table #1: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', joindiff_tables.py:306
'dim_orgs')
DEBUG Collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') joindiff_tables.py:270
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs') base.py:980
SELECT sum(`sub_price`) AS `sum_sub_price`, sum(`num_users`) AS `sum_num_users`, count(*) AS
`count` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000001) AND
(`org_id` < 100000005)
DEBUG Done collecting stats for table #2: ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', joindiff_tables.py:306
'dim_orgs')
DEBUG Testing for null keys: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> joindiff_tables.py:252
('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') base.py:980
SELECT `org_id` FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE (`org_id` >=
100000001) AND (`org_id` < 100000005) AND (`org_id` IS NULL)
.... many repetitive such logs ...
DEBUG Counting differences per column: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', joindiff_tables.py:346
'dim_orgs') <> ('dev', 'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
DEBUG Running SQL (Databricks): ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev', base.py:980
'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
SELECT sum(`is_diff_org_id`), sum(`is_diff_sub_price`), sum(`is_diff_sub_plan`),
sum(`is_diff_sub_created_at`), sum(`is_diff_created_at`), sum(`is_diff_num_users`) FROM (SELECT
(`tmp2`.`org_id` IS NULL) AS `is_exclusive_a`, (`tmp1`.`org_id` IS NULL) AS `is_exclusive_b`, CASE
WHEN `tmp1`.`org_id` is distinct from `tmp2`.`org_id` THEN 1 ELSE 0 END AS `is_diff_org_id`, CASE
WHEN `tmp1`.`sub_price` is distinct from `tmp2`.`sub_price` THEN 1 ELSE 0 END AS
`is_diff_sub_price`, CASE WHEN `tmp1`.`sub_plan` is distinct from `tmp2`.`sub_plan` THEN 1 ELSE 0
END AS `is_diff_sub_plan`, CASE WHEN `tmp1`.`sub_created_at` is distinct from
`tmp2`.`sub_created_at` THEN 1 ELSE 0 END AS `is_diff_sub_created_at`, CASE WHEN
`tmp1`.`created_at` is distinct from `tmp2`.`created_at` THEN 1 ELSE 0 END AS `is_diff_created_at`,
CASE WHEN `tmp1`.`num_users` is distinct from `tmp2`.`num_users` THEN 1 ELSE 0 END AS
`is_diff_num_users`, cast(`tmp1`.`org_id` as string) AS `org_id_a`, cast(`tmp2`.`org_id` as string)
AS `org_id_b`, cast(`tmp1`.`sub_price` as string) AS `sub_price_a`, cast(`tmp2`.`sub_price` as
string) AS `sub_price_b`, cast(`tmp1`.`sub_plan` as string) AS `sub_plan_a`, cast(`tmp2`.`sub_plan`
as string) AS `sub_plan_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`sub_created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `sub_created_at_b`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp1`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_a`,
date_format(timestamp_micros(cast(round(unix_micros(cast(`tmp2`.`created_at` as timestamp)) /
1000000, 6) * 1000000 as bigint)), 'yyyy-MM-dd HH:mm:ss.SSSSSS') AS `created_at_b`,
cast(`tmp1`.`num_users` as string) AS `num_users_a`, cast(`tmp2`.`num_users` as string) AS
`num_users_b` FROM (SELECT * FROM `dev`.`dbt_leoebfolsomdatabricksdemo_prod`.`dim_orgs` WHERE
(`org_id` >= 100000125) AND (`org_id` < 100000158)) `tmp1` FULL OUTER JOIN (SELECT * FROM
`dev`.`dbt_leoebfolsomdatabricksdemo_dev`.`dim_orgs` WHERE (`org_id` >= 100000125) AND (`org_id` <
100000158)) `tmp2` ON (`tmp1`.`org_id` = `tmp2`.`org_id`)) tmp3 WHERE ((`is_diff_org_id` = 1) OR
(`is_diff_sub_price` = 1) OR (`is_diff_sub_plan` = 1) OR (`is_diff_sub_created_at` = 1) OR
(`is_diff_created_at` = 1) OR (`is_diff_num_users` = 1))
INFO Diffing complete: ('dev', 'dbt_leoebfolsomdatabricksdemo_prod', 'dim_orgs') <> ('dev', joindiff_tables.py:165
'dbt_leoebfolsomdatabricksdemo_dev', 'dim_orgs')
dev.dbt_leoebfolsomdatabricksdemo_prod.dim_orgs <> dev.dbt_leoebfolsomdatabricksdemo_dev.dim_orgs
Primary Keys: ['org_id']
rows PROD <> DEV
--------- ------ ------------ -----------------
Total 157 10 [-147]
Added 0
Removed -147
Different 0
Unchanged 10
columns # diff values
-------------- ---------------
created_at 0
num_users 0
sub_created_at 0
sub_plan 0
sub_price 0
08:29:38 ERROR Attempt to close session raised a local exception: client.py:277
HTTPSConnectionPool(host='<redacted>.cloud.databricks.com', port=443): Max retries
exceeded with url: /sql/1.0/warehouses/<redacted> (Caused by SSLError(SSLError(0,
'Underlying socket connection gone (_ssl.c:2407)')))
Describe the environment
OS: MacBook Pro 16-inch, 2021; Apple M1 MaX; 64 GB Mem; Venture 13.4.1
data-diff version: 0.11.1
(I also found the same issue in 0.9.17)
This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.