Calibrate confidence scores
sfmig opened this issue · 1 comments
Is your feature request related to a problem? Please describe.
We usually interpret confidence scores as a proxy for the error estimate in the keypoints prediction. However, it is well known that neural networks tend to be "overly confident" in their predictions. For example, for the multiclass classification case, reference [1] says:
the softmax output of modern neural networks, which typically is interpreted as a categorical distribution in classification, is poorly calibrated.
It would be very useful to be able to produce calibrated confidence scores of the keypoint predictions. That would allow us to compare results across frameworks, better filter high/low confidence values, and better interpret model performance.
Describe the solution you'd like
We could consider having a method in movement
that calibrates confidence scores.
We could implement something similar to what keypoint-moseq does. They have functionality to fit a linear model to the relationship between keypoint error and confidence:
[the function] creates a widget for interactive annotation in jupyter lab. Users mark correct keypoint locations for a sequence of frames, and a regression line is fit to the log(confidence), log(error) pairs obtained through annotation. The regression coefficients are used during modeling to set a prior on the noise level for each keypoint on each frame.
Describe alternatives you've considered
\
Additional context
Nice explanations for the case of classification (note that in pose estimation we do a regression problem, not a classification one):
- https://geoffpleiss.com/blog/nn_calibration.html
- https://scikit-learn.org/stable/modules/calibration.html
From a quick search I found:
- [1] this paper, on the calibration of human pose estimation. They propose a neural network that learns specific adjustments for a pose estimator. Seems out of scope for
movement
but may be a useful read to understand the problem better. - this paper for object detection, could be similarly useful.