pln-fing-udelar/fast-krippendorff

Krippendorff's alpha returns 0 when there's only one disagreement

imranraad07 opened this issue · 4 comments

For a dataset like this:

dataset = {
  "a": [3, 3,    3,    3,    3],
  "b": [3, 3,    3,    3,    3],
  "c": [3, 3,    None, None, 3],
  "d": [3, 3,    3,    3,    1],
  "e": [3, None, 3,    3,    3],
}

Krippendorff's alpha ordinal or nominal in either of the cases, it returns 0.

I don't know if this is an issue of Krippendorff's alpha algorithm itself or the problem with the implementation.

Hey, I recommend you to compute the expected outcome of this example manually with Krippendorff's alpha formula for nominal. For ordinal, I also recommend you passing the potential values as an argument, as it seems that "2" can also be a value.

Not sure in your example if the key-values represent a coder or a unit. I guess each of the keys is a coder, whose values are what they assigned to each unit? And with None do you want to represent a missing assignment? use np.nan instead if so.

I'm closing this issue but feel free to continue it if you think there's something that needs to be addressed.

Hi @bryant1410 "2" is not a value. You can replace the dataset with "1" and "2", instead of "1" & "3", it gets the same results. np.nan also returns the same. Consider this example:

Units u: 1 2 3 4 5 6
Coder A * 1 2 1 1 1
Coder B 1 1 1 * 1 *
Coder C 1 * * 1 1 1
Coder D * 1 * 1 * 1

This gets the same result. Basically when there's only one disagreement, it results 0.

I'm not quite versed in measuring reliability in general, but here's my understanding of the issue. I believe it's a thing about Krippendorff's Alpha.

The measure compares an observed annotation to an expected one. The thing is about how the expected one is computed, which is similar to the way it's done for other reliability measures. The amount of annotations for each value is assumed to be fixed for the computation of the expected annotation (in your example, "2" appears only once, while "1" appears 15 times). So when you look at how the distribution of the pairable values is with respect to "an expected one", they all look the same. You have a situation in which the perfect annotation (alpha = 1) and a random one (alpha = 0) converge.

In general, I see that for Alpha, Kappa, and Pi the best scenario to compute the expected annotation is when each value occurs equally in frequency. And that they have issues when some value occurs rarely because the perfect and the random annotation start to converge, leaving less space in the [0, 1] interval to report the inter-annotator agreement

I also suggest you debugging the alpha function, or maybe just adding prints. Looking at the values o and e after varying the input can be helpful, and also looking at the formula. You can start by cloning this repo then modifying the file sample.py here:

"* * * * * 3 4 1 2 1 1 3 3 * 3", # coder A
"1 * 2 1 3 3 4 3 * * * * * * *", # coder B
"* * 2 1 3 4 4 * 2 1 1 3 3 * 4", # coder C

Then debug alpha call:

print("Krippendorff's alpha for nominal metric: ", krippendorff.alpha(reliability_data=reliability_data,