maxharlow/csvmatch

dash character and -a option

aborruso opened this issue · 9 comments

Hi,
I have these two input files

Name,Age
Andy,32
Mary-Jane,43


Name,City
Andy,Rome
Mary Jane,New York

If I run

csvmatch -i -a -n input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"

I have

Name,Name
Andy,Andy

Using -a "Mary-Jane" should be equal to "Mary Jane". Dash character is a non-alphanumeric char or not?

Thank you

If I create this rule.txt file

 $

and run

csvmatch -i -a -n -l rule.txt input_01.csv input_02.csv --fields1 "Name" --fields2 "Name"

I have also Mary Jane

Name,Name
Andy,Andy
Mary-Jane,Mary Jane

But it has no sense for me, because I have added only a $, a white space at the end of sentence.

Thank you

Ok, I'm not sure if this is a bug or just something that isn't clear in the documentation.

The reason it happens is that 'ignoring' nonalphanumeric characters means they are removed -- so Mary-Jane becomes MaryJane, which doesn't match. A workaround would be to use something like Levenshtein, which even with a high threshold like 85% should produce a match.

One option to resolve problems like this would be to replace nonalphanumerics with spaces instead of removing them. Of course with cases like Lastname, Firstname, you'd end up with two spaces, but with a flag to ignore repeating whitespace characters (such as your suggestion in #29) it could work quite well.

Hi @maxharlow thank you.

But why does it work with $ in -l file? Why "Mary Jane" matches "MaryJane"?

Afraid I wasn't able to replicate that. Would you mind checking again?

@maxharlow look here http://youtu.be/cUfAunJnUuU?hd=1

My files are

# rule.txt
 $
# input_01.csv
Name,Age
Andy,32
Mary-Jane,43
Andrè,50
#input_02.csv
Name,City
Andy,Rome
Mary Jane,New York
Andre',Palermo

How odd. I've tried it with files exactly the same as yours. A single space in rule.txt would make sense, as that would remove the space from the second file, and the -a would remove the hyphen from the first. But with the $, it shouldn't match, and I don't know why it does for you.

Ok @maxharlow I'm closing, I have had a good reply to my issue question.

Thank you

As of v1.19 this should now work as you originally expected

Wow, I'm very proud also of it :)