joshaven/string_score

Hello World and jello

Closed this issue · 5 comments

"Hello World" and "jello" should score higher than 0 with a fuzziness of 0.5, says your test.

To add onto that:

  • "Hello World" and "Hallo World" = 0.1984848484848485
  • "Hello World" and "Hazzzzzzzzz" = 0.1984848484848485

I guess the way the algorithm works it just stops caring after a missed character

other than speed, what is the benefit of using string_score over other proven existing similarity algos like the Jaro-Winkler distance that i've been using for a long time with great success:

http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance

I am not familiar with Jaro-Winkler so I cannot answer.

Thanks for the link I'll study up and try to improve my string_score.

The fuzzyness is something I just added in so the 'stop caring' is not something I've really worked much on. I hope to get some time soon to go back over the project and fix up some minor issues like this.

The wikipidea says that Jaro-Winkler is for short strings like names... my string score will work fine with longer strings (500 chars and more)... this may be one benefit. The issue with string length is actually why I wrote the string score.

issue resolved

Regarding: Jaro Winkler distance
I added a Jaro Winkler comparison. I think looking at this method will help me improve my method a bit. However, the Jaro Winkler does less and is slower (in JavaScript) - which may be only due to the way I have implemented it. I may be able to squeeze a few more milliseconds out... The speed difference is very minor compared to the other options I have looked at. The Jaro (dj) method is great but the Winkler (dw) only is a beginning of string bonus which is not really enough in my estimation. I give bonuses for beginning of string, beginning of word, consecutive characters, and proper case.