markvanderloo/stringdist

Apparent inconsistency in output when both the number of characters of a & b are smaller than q

Opened this issue · 1 comments

Consider the NA correct result of:

stringdist(   "", "XXX"
              , method = "cos"
              , q = 3)

However, if both a and b have nchar() < q, the output becomes 0:

stringdist(   "", "XX"
              , method = "cos"
              , q = 3)

In my view, the output for the second case would be more consistent if it were NA also. Does it make sense?

Thanks, this relates somewhat to #48.

In the formal definition of the qgrams[1] distance, we compare two qgram-vectors, where the length of the vectors is equal to the number of all q-grams that can be created from a chosen alphabet (in our case, the UTF code table). This means that in the first case we have to compare $(0,0,\ldots, 0)$ with $(0,0,\ldots,1,0,0,\ldots,0)$. The cosine distance between these two vectors is

$$ 1 - \frac{\langle (0,0,\ldots, 0),(0,0,\ldots,1,0,0,\ldots,0)\rangle}{|(0,0,\ldots, 0)| |(0,0,\ldots,1,0,0,\ldots,0)| } =1 - \frac{0}{0\cdot 1} = \textrm{undefined} $$

In the second case, we get two zero-vectors as none of the possible 3-grams occur in either input strings.
So we have a choice: do we state that two zero-vectors are equal (in magnitude and direction) and say the distance is zero? Or do we say 'undefined', which is what we get when we fill in the equation?

So the main point is: in the first case we have no choice but to fill in the equation. In the second case we can detect that we have two equal q-gram vectors and use that. I admit that this is subtle.

Finally, the choice seems consitent with this (method=qgram measures sum of absolute differences between qgram profiles)

> stringdist(   "", "XX", method='qgram', q=3)
[1] 0

[1] Ukkonen (1992) theoretical computer science 92 191-211