stringdist with cosine distance returns Inf on duplicated letters
Closed this issue · 3 comments
I was wondering why stringdist
returns Inf
for the combination of dfg
and dfgdfg
:
stringdistmatrix("dfg", "dfgdfg", method = "cosine")
Is this expected behaviour? If yes, why does it happen?
R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.solocale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=Cattached base packages:
[1] stats graphics grDevices utils datasets methods baseother attached packages:
[1] stringdist_0.9.4.6loaded via a namespace (and not attached):
[1] compiler_3.4.3 parallel_3.4.3 tools_3.4.3 yaml_2.1.16
Confirmed. Looks like a bug.
This always seems to happen when a string is exactly duplicated. Is there a chance you can fix this in the near future (I would do it myself, but I am not proficient in C)?
fixed it. Was due to a machine rounding -0
that got converted to Inf
.