form of random coincidence matrix
Garrafao opened this issue · 3 comments
I wonder about the form of the random coincidence matrix:
return np.divide(np.outer(n_v, n_v) - np.diagflat(n_v), n_v.sum() - 1, dtype=dtype)
Is there a source where this matrix is mathematically derived?
Also: what is the logic behind subtracting np.diagflat(n_v)? It only seems to affect the diagonal, which later is ignored by the distance functions having a diagonal of 0's. Doing
return np.divide(np.outer(n_v, n_v), n_v.sum() - 1, dtype=dtype)
yielded the same results in all cases I tested.
Sorry for the delay. It can be derived from the original source, but I think the Wikipedia is clearer:
n_v
in the code is the same n_v
in the equation here. All possible pairs are multiplied, and when the diagonal is computed, actually you have to subtract (n_v * (n_v - 1) = n_v * n_v - n_v
).
Also: what is the logic behind subtracting np.diagflat(n_v)? It only seems to affect the diagonal, which later is ignored by the distance functions having a diagonal of 0's. Doing
I understand that it's unnecessary in virtually all cases. For sure it doesn't hurt the computation much.
I guess it's good to be consistent with the definition (that's what I originally wanted; this concern crossed my mind). And if somebody wants to provide a distance function that violates the d(v, v) = 0
, for whatever reason, it's gonna work for them.
I'm gonna go ahead and close this issue as I believe there's no action item. But we can continue the discussion, and in any case, re-open it.
Thanks. After seeing the definition on Wikipedia I agree that staying consistent with it makes sense.