Lightning-AI/torchmetrics

Discrepancy in optimal threshold calculation between sklearn and torchmetrics ROC implementations

vitalwarley opened this issue · 2 comments

Bug description

There's a noticeable difference in the calculated optimal thresholds when comparing the ROC curve implementations between sklearn.metrics.roc_curve and torchmetrics.functional.roc. Specifically, using the same input data for similarity scores and labels, sklearn produces a significantly lower optimal threshold value compared to torchmetrics.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

import numpy as np
import torch
from sklearn.metrics import roc_curve
import torchmetrics.functional as tm

# Given values
similarities = torch.tensor([0.0938, 0.0041, -0.1011, 0.0182, 0.0932, -0.0269, -0.0266, -0.0298,
                             -0.0200, 0.0816, -0.0122, -0.0026, 0.1237, -0.0149, 0.0840, -0.0192,
                             -0.0488, 0.0114, -0.0076, -0.0583])
is_kin_labels = torch.tensor([1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0])

# Ensure data is on CPU for sklearn compatibility
similarities_ = similarities.cpu().numpy()
is_kin_labels_ = is_kin_labels.cpu().numpy()

# Sklearn calculation
fpr_, tpr_, thresholds_ = roc_curve(is_kin_labels_, similarities_)
maxindex_ = (tpr_ - fpr_).argmax()
best_threshold_sklearn = thresholds_[maxindex_]

# Torchmetrics calculation (assuming similarities and is_kin_labels are already on CPU or CUDA compatible)
fpr, tpr, thresholds = tm.roc(similarities, is_kin_labels, task='binary')
maxindex = (tpr - fpr).argmax()
best_threshold_torchmetrics = thresholds[maxindex].item()

# Output comparison
print(f"Best threshold sklearn: {best_threshold_sklearn:.6f} @ {maxindex_} index of {len(thresholds_)} (fpr={fpr_[maxindex_]:.6f}, tpr={tpr_[maxindex_]:.6f})")
print(f"Best threshold torchmetrics: {best_threshold_torchmetrics:.6f} @ {maxindex} index of {len(thresholds)} (fpr={fpr[maxindex]:.6f}, tpr={tpr[maxindex]:.6f})")

# Best threshold sklearn: 0.093200 @ 2 index of 10 (fpr=0.000000, tpr=0.428571)
# Best threshold torchmetrics: 0.523283 @ 3 index of 21 (fpr=0.000000, tpr=0.428571)

Error messages and logs

No response

Environment

Current environment
  • CUDA:
    • GPU:
      • NVIDIA GeForce RTX 3070 Laptop GPU
    • available: True
    • version: 12.1
  • Lightning:
    • lightning: 2.2.1
    • lightning-utilities: 0.10.1
    • pytorch-lightning: 2.2.1
    • torch: 2.2.1
    • torchmetrics: 1.3.1
    • torchvision: 0.17.1
  • Packages:
    • absl-py: 2.1.0
    • aiohttp: 3.9.3
    • aiosignal: 1.3.1
    • asttokens: 2.4.1
    • attrs: 23.2.0
    • beautifulsoup4: 4.12.3
    • certifi: 2024.2.2
    • cfgv: 3.4.0
    • chardet: 5.2.0
    • charset-normalizer: 3.3.2
    • click: 8.1.7
    • contourpy: 1.2.0
    • cycler: 0.12.1
    • daemonize: 2.5.0
    • debugpy: 1.8.1
    • decorator: 5.1.1
    • distlib: 0.3.8
    • docstring-parser: 0.16
    • executing: 2.0.1
    • filelock: 3.13.1
    • fonttools: 4.50.0
    • frozenlist: 1.4.1
    • fsspec: 2023.12.2
    • gdown: 5.1.0
    • grpcio: 1.62.1
    • guildai: 0.9.0
    • identify: 2.5.35
    • idna: 3.6
    • importlib-resources: 6.3.2
    • ipython: 8.20.0
    • jedi: 0.19.1
    • jinja2: 3.1.3
    • joblib: 1.3.2
    • jsonargparse: 4.27.6
    • kiwisolver: 1.4.5
    • lightning: 2.2.1
    • lightning-utilities: 0.10.1
    • markdown: 3.6
    • markupsafe: 2.1.3
    • matplotlib: 3.8.3
    • matplotlib-inline: 0.1.6
    • mpmath: 1.3.0
    • multidict: 6.0.5
    • natsort: 8.4.0
    • networkx: 3.2.1
    • nodeenv: 1.8.0
    • numpy: 1.26.4
    • nvidia-cublas-cu12: 12.1.3.1
    • nvidia-cuda-cupti-cu12: 12.1.105
    • nvidia-cuda-nvrtc-cu12: 12.1.105
    • nvidia-cuda-runtime-cu12: 12.1.105
    • nvidia-cudnn-cu12: 8.9.2.26
    • nvidia-cufft-cu12: 11.0.2.54
    • nvidia-curand-cu12: 10.3.2.106
    • nvidia-cusolver-cu12: 11.4.5.107
    • nvidia-cusparse-cu12: 12.1.0.106
    • nvidia-nccl-cu12: 2.19.3
    • nvidia-nvjitlink-cu12: 12.3.101
    • nvidia-nvtx-cu12: 12.1.105
    • opencv-python: 4.9.0.80
    • packaging: 24.0
    • parso: 0.8.3
    • pexpect: 4.9.0
    • pillow: 10.2.0
    • pip: 24.0
    • pkginfo: 1.10.0
    • platformdirs: 4.2.0
    • pre-commit: 3.6.2
    • prompt-toolkit: 3.0.43
    • protobuf: 4.25.3
    • psutil: 5.9.8
    • ptyprocess: 0.7.0
    • pure-eval: 0.2.2
    • pygments: 2.17.2
    • pyparsing: 3.1.2
    • pysocks: 1.7.1
    • python-dateutil: 2.9.0.post0
    • pytorch-lightning: 2.2.1
    • pyyaml: 6.0.1
    • requests: 2.31.0
    • scikit-learn: 1.4.1.post1
    • scipy: 1.12.0
    • setuptools: 69.0.3
    • six: 1.16.0
    • soupsieve: 2.5
    • stack-data: 0.6.3
    • sympy: 1.12
    • tabview: 1.4.4
    • tensorboard: 2.16.2
    • tensorboard-data-server: 0.7.2
    • threadpoolctl: 3.3.0
    • torch: 2.2.1
    • torchmetrics: 1.3.1
    • torchvision: 0.17.1
    • tqdm: 4.66.2
    • traitlets: 5.14.1
    • triton: 2.2.0
    • typeshed-client: 2.5.1
    • typing-extensions: 4.9.0
    • urllib3: 2.2.1
    • virtualenv: 20.25.1
    • wcwidth: 0.2.13
    • werkzeug: 3.0.1
    • wheel: 0.42.0
    • yarl: 1.9.4
  • System:
    • OS: Linux
    • architecture:
      • 64bit
      • ELF
    • processor:
    • python: 3.11.8
    • release: 6.7.9-arch1-1
    • version: Lightning-AI/pytorch-lightning#1 SMP PREEMPT_DYNAMIC Fri, 08 Mar 2024 01:59:01 +0000

More info

The output from thresholds_ (using sklearn) and thresholds (using torchmetrics) reveals a significant difference in the threshold values range and granularity:

[ins] In [6]: thresholds_
Out[6]: 
array([    inf,  0.1237,  0.0932,  0.0114, -0.0026, -0.0149, -0.0192,
       -0.02  , -0.0266, -0.1011], dtype=float32)

[ins] In [7]: thresholds
Out[7]: 
tensor([1.0000, 0.5309, 0.5234, 0.5233, 0.5210, 0.5204, 0.5045, 0.5028, 0.5010,
        0.4993, 0.4981, 0.4970, 0.4963, 0.4952, 0.4950, 0.4934, 0.4933, 0.4926,
        0.4878, 0.4854, 0.4747])

Hi! thanks for your contribution!, great first issue!

I think I found the problem. The returned thresholds are probabilities, because

preds (float tensor): (N, ...). Preds should be a tensor containing probabilities or logits for each observation. If preds has values outside [0,1] range we consider the input to be logits and will auto apply sigmoid per element.

So it makes sense. My fault... However, I didn't find it very clear at first.

thresholds: an 1d tensor of size (n_thresholds, ) with decreasing threshold values