agardiner/csv-diff

The comparison is matching two rows on their keys that do not have the same values -- instead, it shows the matched keys in an array.

hammzj opened this issue · 7 comments

Hi,

there seems to be a very critical bug with how key fields are being processed. I will share these in a template I use for defect reporting.

Problem Statement
I am trying to compare large files using this tool and sometimes provide around 5 key fields. A lot of the matches are mapped from the left dataset to the right dataset correctly, but there are a lot of times where we are getting rows that are matching incorrectly on an ID field and therefore comparing incorrectly.

For example, if I were to look at the updates, there will be some rows where their key fields are showing up as an array of differences. Let's say here my two key fields are named PRIMARY_ID and PRIMARY_ID_TYP.

@results.updates #=> [ ..., REEO3551643~INTERNAL = CSVDiff::Algorithm::Diff]
@results.updates['REEO3551643~INTERNAL'].fields #=>{PRIMARY_ID : ['REEO3551643', 'REEO3651675'], price: ['2.50, '3.75'], ...}

So, right here, two rows from the left and right datasets are matching on keys that don't match. This is a huge error to me.

Expected Result:
Only rows that have the same compound key_fields should be matched and compared for "update" differences.

Actual Defect Result:
Rows are being matched that have a different value for a field in their key_fields list, producing many "updates" false positives.

I cannot share our files due to security issues but I will try to get you a code solution later when I can build out a script with a dataset that produces this error.

Another note I'd like to add: both of those PRIMARY_IDs are exclusive to their left and right datasets ("REEO3551643" only in left, "REEO3651675" only in right). They do not exist in both datasets.

Here is the gist that replicates this code. The datasets look slightly complex but here is what I expect:

Expected:
2 additions
2 deletes

Actual:
2 updates

Defect gist

Hi,

Thanks for the defect report and a test case for reproducing the bug. I will try to look at it in more detail in the next few days, but I have determined that the bug appears to be due to parent/child key splitting. If you change from using key_fields (which implicitly treats all but the last field in a compound key as the parent, and the last field as a child) to using child_fields, you should get the correct result.

csv_diff = CSVDiff.new(left.read, right.read,
                       field_names: field_names,
                       child_fields: key_fields,
                       ignore_fields: ignore_fields)
2.3.7
{"Delete"=>2, "Add"=>2}

UPDATES: See the PRIMARY_ID field. That is a key_field.

OK, I've confirmed the issue is definitely due to the splitting of keys into a parent part and a child part. This happens implicitly when you use key_fields instead of parent_fields/child_fields, which I must confess I've felt was the wrong decision for some time now, because it is certainly non-obvious.

csv_diff treats compound keys as having a parent part and a child part, since a compound key can be considered a kind of tree structure. For my personal use cases, I mainly work with trees, and being able to identify when a child moves to a new parent as an update (i.e. a move of the child under a new parent) rather than a delete and add was an important feature.

In the case of trees, my compound keys normally consist of just a parent id and a child id, and so I made the decision that if you only specify key_fields, that would be converted to parent_fields and child_fields automatically by considering the last field the child_id and all other fields as a compound parent key. This was done to make my life easier at the time, since I found myself repeatedly specifying parent_fields: [0], child_fields[1].

Over the years I've realized that was the wrong decision, as I've hit similar issues to this when working with non-tree data, but it was usually pretty obvious when I'd made this mistake, and it was a quick fix to switch from key_fields to child_fields. This bug report though has convinced me I need to finally fix this design mistake, and so I will be pushing an update shortly that changes the implicit treatment of key_fields from:

   parent_fields = key_fields[0...-1]
   child_fields = key_fields[-1]

to:

    parent_fields = []
    child_fields = key_fields

This effectively results in the entire compound key being treated as a single value, and so changes to any part of the key would be considered an add and delete.

Thanks again for sending the bug report and reproduction.

Adam,

Just for understanding -- with the new gem change, I shall not need to use parent_fields and child_fields, but I can continue to use only the key_fields? But, with the previous gem version, it is necessary to use the parent and child options instead? Is that correct?

Immensely I thank you for looking into this issue as quickly as you did. Secondly, may I please ask that the README gets updated to account for these changes and possibly host the explanation you gave me above? I will need to make sure I call this out to my team and the users so we can learn how to build these properly.

All of this work is highly appreciated by me and my team, and I will continue to contribute what I can to this gem.

@agardiner,
I see you've updated the version to 0.6. I will try this out and report back if we see any issues with the key_field change.

All in all, thank you very much for the quick turnaround and attention on this issue!

@agardiner Hey, I am getting back very late but I was waiting on another team to confirm that the change here worked for them and it did. I will close this issue now. Thanks!