markvanderloo/stringdist

Value of Jaccard distance different from the one calculated with TextDistance

Closed this issue · 2 comments

I calculated the Jaccard distance between two strings using the TextDistance Python package (version 4.5.0) in the following way:

import textdistance
textdistance.jaccard.distance('jharrisexamplecom','jsmithexamplecom')

and I get 0.26315789473684215 as result.

I calculated the same distance between the same strings using stringdist (version 3.4.1) in the following way:

library(stringdist)
stringdist('jharrisexamplecom','jsmithexamplecom', method = 'jaccard')

and I'm getting 0.1428571 this time. I checked that by default q = 1 is considered in both the cases.

Why do I see this difference in the calculation?

Not completely sure what python does, but a quick test shows that R does what I would expect:

> str1 <- "jharrisexamplecom"
> str2 <- "jsmithexamplecom"
> str2<- strsplit(str2, "")[[1]]
> str1<- strsplit(str1, "")[[1]]
> letters <- unique(c(str1, str2))
> letters
 [1] "j" "h" "a" "r" "i" "s" "e" "x" "m" "p" "l" "c" "o" "t"
> letters1 <- letters %in% str1
> letters2 <- letters %in% str2
> letters1
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE FALSE
> letters2
 [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE
> letters2 & letters1
 [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
[13]  TRUE FALSE
> sum(letters2 & letters1)
[1] 12
> 1-sum(letters2 & letters1)/length(letters)
[1] 0.1428571

I noticed that TextDistance also provides the class textdistance.Jaccard(), which accepts the as_set parameter. The default value of this parameter is False, so union and intersection operations don't take the uniqueness of elements into account! If you set it to True, the values are equal to the stringdist ones.