For temporal and sequential data (e.g. in biomedical applications), standard performance evaluation metrics, such as sensitivity and specificity, may not always be the most appropriate and can even be misleading. Evaluation metrics must ultimately reflect the needs of users and also be sufficiently sensitive to guide algorithm development.
For example, for epilepsy monitoring, neurologists ask for assesments on the level of seizure episodes (events), rather than duration or sample-by-sample based metrics. Similarly, another performance measure with a strong practical impact in epilepsy monitoring, is the false alarm rate (FAR), or the number of false positives per hour/day. Clinicians and patients see this measure as more meaningful than some established metrics in the ML community, and are very demanding in terms of performance, requiring it to be as low as possible for potential wearable applications (e.g., less than 1 FP/day). This also necessitates exceptionally high constraints on the required precision (usually much higher than 99%).
For this reason, here we provide code that measures performance on the level of events and on a sample-by-sample basis.
In more details, we measures performance on the level of:
- Sample : Performance metric that threats every label sample independently.
- Events (e.g. epileptic seizure) : Classifies each event in both reference and hypothesis based on overlap of both.
Both methods are illustrated in the following figures :
The timescoring package is released for macOS, Windows and Linux on PyPi. It can be installed using pip
:
python -m pip install -U pip
python -m pip install -U timescoring
The package is also available on conda-forge. It can be installed using conda
:
conda install -c conda-forge timescoring
It can also be installed from source with a modern build of pip
:
python -m pip install -U pip
git clone https://github.com/esl-epfl/epilepsy_performance_metrics.git
cd epilepsy_performance_metrics
python -m pip install -e .
The timescoring
package provides three classes :
annotation.Annotation
: store annotationsscoring.SampleScoring(ref, hyp)
: Compute sample based scoringscoring.EventScoring(ref, hyp)
: Compute event based scoring
In addition it also provides functions to visualize the output of the scoring algorithm (see visualization.py
).
Sample based scoring allows to set the sampling frequency of the labels. It defaults to 1 Hz.
Event based scoring allows to define certain parameters which are provided as an instance of scoring.EventScoring.Parameters
:
toleranceStart
(float): Allow some tolerance on the start of an event without counting a false detection. Defaults to 30 # [seconds].toleranceEnd
(float): Allow some tolerance on the end of an event without counting a false detection. Defaults to 60 # [seconds].minOverlap
(float): Minimum relative overlap between ref and hyp for a detection. Defaults to 0 which corresponds to any overlap # [relative].maxEventDuration
(float): Automatically split events longer than a given duration. Defaults to 5*60 # [seconds].minDurationBetweenEvents
(float): Automatically merge events that are separated by less than the given duration. Defaults to 90 # [seconds].
Scores are provided as attributes of the scoring class. The following metrics can be accesses :
sensitivity
precision
f1
: F1-scorefpRate
: False alarm rate per 24h
# Loading Annotations #
from timescoring.annotations import Annotation
# Annotation objects can be instantiated from a binary mask
fs = 1
mask = [0, 1, 1, 0, 0, 0, 1, 1, 1, 0]
labels = Annotation(mask, fs)
print('Annotation objects contain a representation as a mask and as a list of events:')
print(labels.mask)
print(labels.events)
# Annotation object can also be instantiated from a list of events
fs = 1
numSamples = 10 # In this case the duration of the recording in samples should be provided
events = [(1, 3), (6, 9)]
labels = Annotation(events, fs, numSamples)
# Computing performance score #
from timescoring import scoring
from timescoring import visualization
fs = 1
duration = 66 * 60
ref = Annotation([(8 * 60, 12 * 60), (30 * 60, 35 * 60), (48 * 60, 50 * 60)], fs, duration)
hyp = Annotation([(8 * 60, 12 * 60), (28 * 60, 32 * 60), (50.5 * 60, 51 * 60), (60 * 60, 62 * 60)], fs, duration)
scores = scoring.SampleScoring(ref, hyp)
figSamples = visualization.plotSampleScoring(ref, hyp)
# Scores can also be computed per event
param = scoring.EventScoring.Parameters(
toleranceStart=30,
toleranceEnd=60,
minOverlap=0,
maxEventDuration=5 * 60,
minDurationBetweenEvents=90)
scores = scoring.EventScoring(ref, hyp, param)
figEvents = visualization.plotEventScoring(ref, hyp, param)
print("# Event scoring\n" +
"- Sensitivity : {:.2f} \n".format(scores.sensitivity) +
"- Precision : {:.2f} \n".format(scores.precision) +
"- F1-score : {:.2f} \n".format(scores.f1) +
"- FP/24h : {:.2f} \n".format(scores.fpRate))
A presentation explaining these metrics is available here.