Framewise transcription evaluation
Opened this issue · 12 comments
TL;DR: Basically all I'm asking for is taking frames as inputs to @rabitt's mir_eval.multipitch
module.
Hi everyone,
recent transcription papers:
- Kelz et al., "On the Potential of Simple Framewise Approaches to Piano Transcription ". https://arxiv.org/abs/1612.05153
- Sigtia et al., "An End-to-End Neural Network for Polyphonic Piano Music Transcription"
This is basically http://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification with using the macro
samples
parameter, but scaled with the number of frames labels.
As it seems that people use it, would it be useful to have this in mir_eval
? If we go with the scikit-learn
implementation which I would strongly suggest, this adds it back as a dependency. Opinions?
Thanks for bringing this up. A few questions -
- Is there a reference implementation for these, e.g. something we can compare the scores generated by a hypothetical
mir_eval
implementation to? - Is this metric widely (e.g. community consensus) accepted, i.e. will it be used in MIREX?
- Is
scikit-learn
in a better place in terms of easy installability (e.g. thanks to wheels, or whatever)? The reason we made effort to not include it as a dependency was because installation was non-trivlal, e.g. IIRC it was not straightforward to get it installed on the Heroku instance which runsmir_eval
as a service. It also made it impossible to create binaries viapyinstaller
, but we gave up on that #65 .
fwiw, in sound event detection (SED) there seems to be a growing preference for frame-(or "segment", which is basically some fixed time duration)-based evaluation over event-based evaluation (which is equivalent to note-based), because the latter is very penalizing (consider the case where an algorithm returns two consecutive notes for a single reference note - the first note would only be a match if you ignore offsets and the second would always be treated as wrong, even though both match the reference in pitch and time if you ignore the split). So regardless of what the trend in MIREX is (I'm abroad and can't seem to load the mirex website right now), I expect we'll see frame-level metrics used more and more in transcription papers.
In this context I should mention there was an interesting attempt at introducing more note-based transcription metrics in order to provide greater insight into system performance precisely due to this issue by Molina et al., though it was focused on singing transcription and I'm not sure whether it has been adopted by the community.
With regards to @craffel's questions:
- Wouldn't sklearn itself be a reference implementation, given that frame-level metrics are kinda domain agnostic (every pitch in every frame is either right/wrong, and from that you compute the F-score as you would for any IR problem)? Hmm, as I write this, I guess you do have to define what a "hit" means, especially when comparing to annotations with a pitch resolution finer than semitones. I wonder whether there's a consensus there? (e.g. in melody ex. the pitch distance must be within 50 cents for a hit)
- Bad internet = can't check the mirex site. But my hunch is that these metrics will become increasingly popular.
- Donno.
Bad internet = can't check the mirex site. But my hunch is that these metrics will become increasingly popular.
No, it seems to be down...
One point of reference is:
M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-F0 estimation and tracking systems.” in Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 315–320.
Relating to sklearn's function, there is still confusion on which aggregation function to use (micro
or samples
)
Pinging @sidsig, @fdlm, and @emmanouilb: Maybe you can enlighten us here a little bit?
One point of reference is:
M. Bay, A. F. Ehmann, and J. S. Downie, “Evaluation of multiple-F0 estimation and tracking systems.” in Proceedings of the 9th International Society for Music Information Retrieval Conference (ISMIR), 2009, pp. 315–320.
That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events.
Perhaps I should add a qualification to my SED analogy - for SED I think it can be less important to focus on discrete events (depending on the source!) and rather consider presence/absence over time. However, in music discrete notes are very much a thing, and music notation is a well established paradigm (as is piano-roll), so I'd be reluctant to abandon note-based eval for transcription altogether.
I think the most complete option is to compute both frame and note-level metrics, as done by Sigtia et al., so it would be nice to support that.
That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events
Right! But I think frame-wise evaluation would work the same for both multi-f0 tracking and 'real' transcription. I don't know if it makes sense for note transcription, though.
Anyways, following formulas 1, 2, and 3 in Bay et al., assuming we sampled predictions and targets at a specified frame rate, we get two bit vectors pred
and targ
. Then, computing true positives, false positives, false negatives, precision and recall is just
tp = float((pred & targ).sum())
fp = float((pred & ~targ).sum())
fn = float((targ & ~pred).sum())
p = tp / (tp + fp)
r = tp / (tp + fn)
f1 = 2 * p * r / (p + r)
This corresponds to the micro
setting in sklearn.
I think the most complete option is to compute both frame and note-level metrics, as done by Sigtia et al., so it would be nice to support that.
Definitely.
That's a multi-f0 tracking paper though, not transcription. It's important not to confound these, as transcription is a more involved task requiring segmentation/quantization into discrete note events.
I just tracked down the literature given by Sigtia et al. and I agree, defining the task is very important here. For me music transcription involved the step of aggregating framewise candidates to a sequence of notes (which then matches the MIDI-like list of note events for the transcription metrics).
So, how do you call then the step of having only the framewise candidates? multi-f0 tracking?
actually, the formulas in the two papers linked by stefan are very likely the wrong ones, ... i wrote up the whole ugly mess here:
evaluation_shenanigans.pdf
TL;DR:
when i finished the paper, i just copied the formulas over at the last minute, w/out double-checking -- in sigtia's paper they actually reference the paper that defines the measures as in the 'micro' setting in sklearn (bay et. al. 2009), but write the (unnormalized, kind of non-sensical in this form) formulas for the 'samples' setting.
the actual evaluation used is in our paper is equivalent to the 'micro' setting for sklearn.
Btw, note-level eval is already implemented in mir_eval
, as is multi-f0. So (assuming the implementations are correct) to get frame-level metrics for transcription you'd just have to sample the note events onto a fixed time grid (which I think is also already implemented somewhere) and then feed that into the multi-f0 metrics, as @stefan-balke noted in the first comment. Pinging @rabitt
to get frame-level metrics for transcription you'd just have to sample the note events onto a fixed time grid (which I think is also already implemented somewhere) and then feed that into the multi-f0 metrics
Based on my understanding of the metrics being discussed here, this seems correct. What functionality is currently missing?
Ping. If there is any functionality missing, please make it clear; otherwise, I will close.
Last I heard we'd reached agreement on how this should be implemented, but I assumed @stefan-balke was the one who was actually going to do it?
Yep, on my list. @justinsalamon, see you at ICASSP then :)