luozhouyang/python-string-similarity

speed of JaroWinkler

Closed this issue · 4 comments

JaroWinkler is slower than jellyfish's implementation. Also, the results are different.

%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'

import jellyfish
jellyfish.jaro_winkler(a,b) 
# 3.97 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# result > 0.35942760942760943

%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'
from strsimpy.jaro_winkler import JaroWinkler
jarowinkler = JaroWinkler()
jarowinkler.distance(a,b)
# 69.8 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# result > 0.6405723905723906

I am not sure why the results are different, but strsimpy gets the same results with java-string-similarity/JaroWinklerTest.java:

>>> jarowinkler.similarity('My string', 'My tsring')
0.9740740740740741
>>> jarowinkler.similarity('My string', 'My ntrisg')
0.8962962962962963

jellyfish has both Python and C implementation of JaroWinkler. Which implementation did you use for comparision?

I used python implementation

I know why the two results are different. jellyfish.jaro_winkler(a, b) calculate the similarity between a and b, but jarowinkler.distance(a,b) calculate the distance between a and b. If you use jarowinkler.similarity(a,b), you can get the same result.

jellyfish use the C implementation of JaroWinkler as default. Here is the code in jellyfish/__init__.py

import warnings

try:
    from .cjellyfish import *  # noqa

    library = "C"
except ImportError:
    from ._jellyfish import *  # noqa

    library = "Python"