jsbaan/calibration-on-disagreement-data
Code accompanying the EMNLP 2022 paper "Stop Measuring Calibration When Humans Disagree" in which we show problems with popular calibration metrics like ECE in settings where more than one answer is acceptable, and argue for several metrics that take into account the full human judgement distribution.
Jupyter Notebook