pln-fing-udelar/fast-krippendorff

form of random coincidence matrix

Garrafao opened this issue · 3 comments

I wonder about the form of the random coincidence matrix:

return np.divide(np.outer(n_v, n_v) - np.diagflat(n_v), n_v.sum() - 1, dtype=dtype)

Is there a source where this matrix is mathematically derived?

Also: what is the logic behind subtracting np.diagflat(n_v)? It only seems to affect the diagonal, which later is ignored by the distance functions having a diagonal of 0's. Doing

return np.divide(np.outer(n_v, n_v), n_v.sum() - 1, dtype=dtype)

yielded the same results in all cases I tested.

Sorry for the delay. It can be derived from the original source, but I think the Wikipedia is clearer:

image

n_v in the code is the same n_v in the equation here. All possible pairs are multiplied, and when the diagonal is computed, actually you have to subtract (n_v * (n_v - 1) = n_v * n_v - n_v).

Also: what is the logic behind subtracting np.diagflat(n_v)? It only seems to affect the diagonal, which later is ignored by the distance functions having a diagonal of 0's. Doing

I understand that it's unnecessary in virtually all cases. For sure it doesn't hurt the computation much.

I guess it's good to be consistent with the definition (that's what I originally wanted; this concern crossed my mind). And if somebody wants to provide a distance function that violates the d(v, v) = 0, for whatever reason, it's gonna work for them.

I'm gonna go ahead and close this issue as I believe there's no action item. But we can continue the discussion, and in any case, re-open it.

Thanks. After seeing the definition on Wikipedia I agree that staying consistent with it makes sense.