markvanderloo/stringdist

stringdist with cosine distance returns Inf on duplicated letters

Closed this issue · 3 comments

I was wondering why stringdist returns Inf for the combination of dfg and dfgdfg:

stringdistmatrix("dfg", "dfgdfg", method = "cosine")

Is this expected behaviour? If yes, why does it happen?

R version 3.4.3 (2017-11-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.19.so

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=C LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] stringdist_0.9.4.6

loaded via a namespace (and not attached):
[1] compiler_3.4.3 parallel_3.4.3 tools_3.4.3 yaml_2.1.16

Confirmed. Looks like a bug.

This always seems to happen when a string is exactly duplicated. Is there a chance you can fix this in the near future (I would do it myself, but I am not proficient in C)?

fixed it. Was due to a machine rounding -0 that got converted to Inf.