markvanderloo/stringdist

stringdist/qgram behaviour when q<nchar(x)

markvanderloo opened this issue · 0 comments

I understand that the q-gram distance is the sum of absolute differences between q-gram vectors of both strings. But I see some weird behavior when one of the strings is shorter than the chosen q.

So for these two strings, while the qgrams function is correct:

> qgrams("a", "the cat sat on the mat", q = 2)
   th he t  sa on n  ma e   c ca at  s  t  o  m
V1  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
V2  2  2  2  1  1  1  1  2  1  1  3  1  1  1  1
The stringdist function returns:
> stringdist("a", "the cat sat on the mat", q = 2, method = "qgram")
[1] Inf

Instead of returning:

> sum(qgrams("a", "the cat sat on the mat", q = 2)[2,])
[1] 21

Posted at SO by Giora Simchoni.