Value of Jaccard distance different from the one calculated with TextDistance
Closed this issue · 2 comments
I calculated the Jaccard distance between two strings using the TextDistance Python package (version 4.5.0) in the following way:
import textdistance
textdistance.jaccard.distance('jharrisexamplecom','jsmithexamplecom')
and I get 0.26315789473684215 as result.
I calculated the same distance between the same strings using stringdist
(version 3.4.1) in the following way:
library(stringdist)
stringdist('jharrisexamplecom','jsmithexamplecom', method = 'jaccard')
and I'm getting 0.1428571 this time. I checked that by default q = 1
is considered in both the cases.
Why do I see this difference in the calculation?
Not completely sure what python does, but a quick test shows that R does what I would expect:
> str1 <- "jharrisexamplecom"
> str2 <- "jsmithexamplecom"
> str2<- strsplit(str2, "")[[1]]
> str1<- strsplit(str1, "")[[1]]
> letters <- unique(c(str1, str2))
> letters
[1] "j" "h" "a" "r" "i" "s" "e" "x" "m" "p" "l" "c" "o" "t"
> letters1 <- letters %in% str1
> letters2 <- letters %in% str2
> letters1
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE FALSE
> letters2
[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE TRUE
> letters2 & letters1
[1] TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[13] TRUE FALSE
> sum(letters2 & letters1)
[1] 12
> 1-sum(letters2 & letters1)/length(letters)
[1] 0.1428571
I noticed that TextDistance also provides the class textdistance.Jaccard()
, which accepts the as_set
parameter. The default value of this parameter is False
, so union and intersection operations don't take the uniqueness of elements into account! If you set it to True
, the values are equal to the stringdist ones.