Counts from command line summary do not match up with row counts in materialized results
zigouras opened this issue · 1 comments
Describe the bug
The counts reported on the command line by the open source data-diff program are correct for rows exclusive to table a, exclusive to b and changed rows. However, the counts of the result rows in the materialized results table do not match what I would expect based on those counts. Only the count of rows exclusive to table B is correct. SQL outside of data-diff confirms the inaccuracy of the materialized results.
Make sure to include the following (minus sensitive information):
data-diff \
$VACANT_LOTS_DB \
public.li_complaints \
backup_.li_complaints \
-k service_request_id \
-c '%' \
-m 'backup_.li_complaints_diff' \
--stats
- The run output + error you're getting. (including tracestack)
247415 rows in table A
246058 rows in table B
1560 rows exclusive to table A (not present in B)
203 rows exclusive to table B (not present in A)
929 rows updated
244926 rows unchanged
1.01% difference score
Extra-Info:
diff_counts = {'service_request_id_a': 19, 'service_code_a': 15, 'service_name_a': 19, 'address_a': 19, 'subject_a': 19,
'y_a': 19, 'x_a': 19, 'status_a': 19, 'geometry_a': 19}
exclusive_count = 19
table1_count = 247415
table1_sum_x = -18593115.109934963
table1_sum_y = 9894894.846454354
table2_count = 246058
table2_sum_x = -18491140.8768517
table2_sum_y = 9840625.990917146
Count from SQL against the materialized table:
select is_exclusive_a, is_exclusive_b, count(*) from backup_.li_complaints_diff
group by is_exclusive_a,is_exclusive_b;
is_exclusive_a | is_exclusive_b | count
----------------+----------------+-------
f | f | 832
f | t | 203
t | f | 795
Raw counts from the source and target tables:
select count(*) from li_complaints p where not exists (select service_request_id from
backup_.li_complaints where service_request_id = p.service_request_id) ;
count
-------
1560
(1 row)
select count(*) from backup_.li_complaints b where not exists (select service_request_id from
li_complaints where service_request_id = b.service_request_id) ;
count
-------
203
(1 row)
- Run data-diff with the
-d
switch for extra debug information.
see attached file data-diff-debug.txt
data-diff-debug.txt
Describe the environment
data-diff 0.11.1
Python 3.11.4
os: Ubuntu Linux on WSL on Windows 10
PostgreSQL 14 on Ubuntu
I think I figured it out, the default --table-write-limit of 1000 was the problem. I increased it and the numbers look correct in the materialized table now.
This issue can be deleted.