datafold/data-diff

data-diff fails to identify unique composite primary keys

gabrielleberanger opened this issue · 5 comments

Describe the bug

I am running data-diff on a model having a composite primary key, configured as such:

version: 2
​
models:
  - name: bse_metabase__permissions_group_membership
    tests:
      - dbt_utils.unique_combination_of_columns:
          combination_of_columns: ['user_id', 'group_id']

But the data-diff execution fails on the _test_duplicate_keys method (while there are actually no duplicates).

Suggested resolution

data-diff checks for duplicates using CONCAT(<field_1>, <field_2>), but CONCAT(<field_1>, '-', <field_2>) would be more appropriate.

Let's assume that we have the following rows:

  • Row 1: user_id = 1 x group_id = 11
  • Row 2: user_id = 11 x group_id = 1

In the above example:

  • CONCAT(<field_1>, <field_2>) returns 111 for both rows (hence the error raised)
  • CONCAT(<field_1>, '-', <field_2>) returns 1-11 and 11-1, which is more robust

@gabrielleberanger Do you still want to work on this pull request? I remember us talking about this in the dbt community slack!

@sungchun12 yes, I do!
I'm starting to work on it right now, so you should receive a PR from me by the end of this week.

Lovely, look forward to reviewing it when it's ready!

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.