Behaviors of AUROC and Average Precision are inconsistent when all labels are equal
weihua916 opened this issue ยท 4 comments
๐ Bug
When all labels are equal (either all zeros or all ones), the current implementation of AUROC
and AveragePrecision
have pretty different behaviors.
When labels are all ones, AUROC gives 0, while AP gives 1.
When labels are all zeros, AUROC gives 0, while Average Precision gives NaN.
I think it is better to add a flag such that both metrics would return NaN when all labels are equal to better inform users.
To Reproduce
>>> from torchmetrics import AUROC, AveragePrecision
>>> import torch
>>> auroc = AUROC(task = "binary")
>>> ap = AveragePrecision(task = "binary")
>>> preds = torch.randn(10)
>>> labels = torch.ones(10, dtype = torch.long)
>>> auroc(preds, labels)
/opt/homebrew/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: No negative samples in targets, false positive value should be meaningless. Returning zero tensor in false positive score
warnings.warn(*args, **kwargs) # noqa: B028
tensor(0.)
>>> ap(preds, labels)
tensor(1.)
>>> labels = torch.zeros(10, dtype = torch.long)
>>> auroc(preds, labels)
/opt/homebrew/anaconda3/lib/python3.9/site-packages/torchmetrics/utilities/prints.py:43: UserWarning: No positive samples in targets, true positive value should be meaningless. Returning zero tensor in true positive score
warnings.warn(*args, **kwargs) # noqa: B028
tensor(0.)
>>> ap(preds, labels)
tensor(nan)
Expected behavior
when labels are all equal, it should return a NaN all the time.
at least, there can be a flag like equal_label_mode
.
>>> ap = AveragePrecision(task = "binary", equal_label_mode = "nan")
>>> ap = AveragePrecision(task = "binary", equal_label_mode = "nan")
that give the expected behavior.
Environment
- TorchMetrics version: installed via pip. '1.3.0.post0'
- Python & PyTorch Version (e.g., 1.0): Python 3.9.12, torch 2.1.0
- Any other relevant information such as OS (e.g., Linux): Linus
Hi @weihua916, thanks for raising this issue.
I created PR #2507 that is intended to close this issue. The intention behind our implementations are to match sklearn pretty close. By this I mean:
- Averageprecision when all labels are 1 in sklearn returns a score of 1, which we are also doing
- Averageprecision when all labels are 0 in sklearn returns a score of -0.0, where our implementation returns nan. That is not the intention and this will be fixed in PR #2507 to raise a user warning and return -0.0 similar to sklearn.
- AUROC in sklearn completely fails in both the case when all labels are 1 and all labels are 0. We instead have chosen to raise user warnings that scores in both cases are essentially undefined and return the arbitrary score of 0. The reason for this is that other users have requested that metrics do not crash there code during training, which will also happens if the scores return
nan
. We therefore have chosen to go with a real, but arbitrary score.
Thank you for addressing the issue! For AUROC, I personally still believe nan
is better, since it's easy to convert nan
to 0 outside of torch-metrics. Currently, the arbitrary AUROC score of 0 may be confused with the actual score of 0.
@weihua916 I do not necessarily disagree with you on that auroc should return nan
and not 0, however we had overwhelmingly feedback when the metric was introduced in the beginning that this was to be preferred.
Understood. Thanks for your consideration!