agardiner/csv-diff

Question on comparing high-precision decimal values

hammzj opened this issue · 3 comments

Hello,
firstly, thank you for this gem as this tool is incredibly useful. We need this in my department, a data-heavy area, to compare files with very precise values.

Now, because of this, we find that updates will still occur if some files have values that have extremely high precision -- like a number being off in like, the 17th decimal place, for example. We are allowed to round our values, for example, to like the 7th decimal place, but the results of this gem outputs results saying there are differences that we may not be concerned with.

Because of this, I was wondering, is there a way within this gem to declare precision for certain columns? Or give them as specific Ruby datatypes? Or will I need to extend this functionality within the project I am doing?

Thank you.

Hi,

Really glad you find this tool useful. :-)

For your issue, you can make use of equality procs to do the diff comparison. When you create the CSVDiff object to do the compare, you can pass an :equality_procs key in the options hash. The value for this key should be another hash, whose keys match the names of the field(s) where high precision values need to be compared, and whose value is a lamda or Proc that will take the left and right values and return a boolean indicating if the values are the same or not.

As an example, if you have a couple of files with the following layout:

Key Text1 Text2 Float1 Float2
... ... ... ... ...

You could do a 2 decimal place comparison of the values in Float1 and Float2 as follows:

# Create a lambda to compare two values at 2 decimal places precision
dp_comp = lambda{ |left, right| left.round(2) == right.round(2) }
opts = {
    # Name the fields in the file
    fields: [:key, :text1, :text2, :float1, :float2],
    # Specify the field that is the key
    key_field: :key,
    # Specify which fields should use non-default equality testing
    equality_procs: {
        float1: dp_comp, float2: dp_comp
    }
}

diff = CSVDiff.new(from, to, opts).diffs

Hope the above example helps make clear how to achieve this.

Cheers

@agardiner
Oh my, this might be perfect for what I need. I'll need to try this and close this issue if I think it works for what I need. Thank you so much!

Edit: one more thing: will this work for procs and blocks as well?

@agardiner I think this will fix my solution. I had to do a little bit of transformations to use serialized procs as I am passing values from a YAML file, but this will work for me. Thanks!

Edit: One more thing, this will not work if the fields are declared as Strings but the equality procs are Symbols. I think this could be a later issue since they don't match each other.