speed of JaroWinkler
Closed this issue · 4 comments
reza1615 commented
JaroWinkler is slower than jellyfish's implementation. Also, the results are different.
%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'
import jellyfish
jellyfish.jaro_winkler(a,b)
# 3.97 µs ± 169 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# result > 0.35942760942760943
%%timeit
a = 'book egwrhgr rherh'
b = 'fvdaabavvvvvadvdvavavadfsfsdafvvav book teee'
from strsimpy.jaro_winkler import JaroWinkler
jarowinkler = JaroWinkler()
jarowinkler.distance(a,b)
# 69.8 µs ± 706 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# result > 0.6405723905723906
luozhouyang commented
I am not sure why the results are different, but strsimpy
gets the same results with java-string-similarity/JaroWinklerTest.java:
>>> jarowinkler.similarity('My string', 'My tsring')
0.9740740740740741
>>> jarowinkler.similarity('My string', 'My ntrisg')
0.8962962962962963
jellyfish
has both Python and C implementation of JaroWinkler
. Which implementation did you use for comparision?
reza1615 commented
I used python implementation
luozhouyang commented
I know why the two results are different. jellyfish.jaro_winkler(a, b)
calculate the similarity between a
and b
, but jarowinkler.distance(a,b)
calculate the distance between a
and b
. If you use jarowinkler.similarity(a,b)
, you can get the same result.
luozhouyang commented
jellyfish
use the C implementation of JaroWinkler
as default. Here is the code in jellyfish/__init__.py
import warnings
try:
from .cjellyfish import * # noqa
library = "C"
except ImportError:
from ._jellyfish import * # noqa
library = "Python"