datafold/data-diff

Counts from command line summary do not match up with row counts in materialized results

zigouras opened this issue · 1 comments

Describe the bug
The counts reported on the command line by the open source data-diff program are correct for rows exclusive to table a, exclusive to b and changed rows. However, the counts of the result rows in the materialized results table do not match what I would expect based on those counts. Only the count of rows exclusive to table B is correct. SQL outside of data-diff confirms the inaccuracy of the materialized results.

Make sure to include the following (minus sensitive information):

data-diff \
  $VACANT_LOTS_DB \
  public.li_complaints \
  backup_.li_complaints \
  -k service_request_id \
  -c '%' \
  -m 'backup_.li_complaints_diff' \
  --stats 
  • The run output + error you're getting. (including tracestack)
247415 rows in table A
246058 rows in table B
1560 rows exclusive to table A (not present in B)
203 rows exclusive to table B (not present in A)
929 rows updated
244926 rows unchanged
1.01% difference score

Extra-Info:
  diff_counts = {'service_request_id_a': 19, 'service_code_a': 15, 'service_name_a': 19, 'address_a': 19, 'subject_a': 19, 
'y_a': 19, 'x_a': 19, 'status_a': 19, 'geometry_a': 19}
  exclusive_count = 19
  table1_count = 247415
  table1_sum_x = -18593115.109934963
  table1_sum_y = 9894894.846454354
  table2_count = 246058
  table2_sum_x = -18491140.8768517
  table2_sum_y = 9840625.990917146

Count from SQL against the materialized table:

select is_exclusive_a, is_exclusive_b, count(*) from backup_.li_complaints_diff 
group by is_exclusive_a,is_exclusive_b;
 is_exclusive_a | is_exclusive_b | count 
----------------+----------------+-------
 f              | f              |   832
 f              | t              |   203
 t              | f              |   795

Raw counts from the source and target tables:

select count(*) from li_complaints p where not exists (select service_request_id from 
backup_.li_complaints where service_request_id = p.service_request_id) ;
 count 
-------
  1560
(1 row)

select count(*) from backup_.li_complaints b where not exists (select service_request_id from 
li_complaints where service_request_id = b.service_request_id) ;
 count 
-------
   203
(1 row)
  • Run data-diff with the -d switch for extra debug information.
    see attached file data-diff-debug.txt
    data-diff-debug.txt

Describe the environment

data-diff 0.11.1
Python 3.11.4
os: Ubuntu Linux on WSL on Windows 10
PostgreSQL 14 on Ubuntu

I think I figured it out, the default --table-write-limit of 1000 was the problem. I increased it and the numbers look correct in the materialized table now.

This issue can be deleted.