Krippendorff's alpha returns 0 when there's only one disagreement
imranraad07 opened this issue · 4 comments
For a dataset like this:
dataset = {
"a": [3, 3, 3, 3, 3],
"b": [3, 3, 3, 3, 3],
"c": [3, 3, None, None, 3],
"d": [3, 3, 3, 3, 1],
"e": [3, None, 3, 3, 3],
}
Krippendorff's alpha ordinal or nominal in either of the cases, it returns 0.
I don't know if this is an issue of Krippendorff's alpha algorithm itself or the problem with the implementation.
Hey, I recommend you to compute the expected outcome of this example manually with Krippendorff's alpha formula for nominal. For ordinal, I also recommend you passing the potential values as an argument, as it seems that "2" can also be a value.
Not sure in your example if the key-values represent a coder or a unit. I guess each of the keys is a coder, whose values are what they assigned to each unit? And with None
do you want to represent a missing assignment? use np.nan
instead if so.
I'm closing this issue but feel free to continue it if you think there's something that needs to be addressed.
Hi @bryant1410 "2" is not a value. You can replace the dataset with "1" and "2", instead of "1" & "3", it gets the same results. np.nan
also returns the same. Consider this example:
Units u: | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|
Coder A | * | 1 | 2 | 1 | 1 | 1 |
Coder B | 1 | 1 | 1 | * | 1 | * |
Coder C | 1 | * | * | 1 | 1 | 1 |
Coder D | * | 1 | * | 1 | * | 1 |
This gets the same result. Basically when there's only one disagreement, it results 0.
I'm not quite versed in measuring reliability in general, but here's my understanding of the issue. I believe it's a thing about Krippendorff's Alpha.
The measure compares an observed annotation to an expected one. The thing is about how the expected one is computed, which is similar to the way it's done for other reliability measures. The amount of annotations for each value is assumed to be fixed for the computation of the expected annotation (in your example, "2" appears only once, while "1" appears 15 times). So when you look at how the distribution of the pairable values is with respect to "an expected one", they all look the same. You have a situation in which the perfect annotation (alpha = 1) and a random one (alpha = 0) converge.
In general, I see that for Alpha, Kappa, and Pi the best scenario to compute the expected annotation is when each value occurs equally in frequency. And that they have issues when some value occurs rarely because the perfect and the random annotation start to converge, leaving less space in the [0, 1] interval to report the inter-annotator agreement
I also suggest you debugging the alpha
function, or maybe just adding prints. Looking at the values o
and e
after varying the input can be helpful, and also looking at the formula. You can start by cloning this repo then modifying the file sample.py here:
Lines 10 to 12 in d1527fe
Then debug alpha
call:
Line 19 in d1527fe