Use mir_eval.separation.evaluate with speech and noise signals.

Question

Use mir_eval.separation.evaluate with speech and noise signals.

Opened this issue 7 years ago · 2 comments

I would like to use mir_eval.separation.evaluate to evaluate the separation performance (SDR, SIR, SNR) of a separation system in the presence of noise.

We may assume the following:

x_1: Clean speech, speaker 1
x_2: Clean speech, speaker 2
n: Additive noise

Now we may have a system S, which estimates three (possible permuted) enhanced signals:
z_1: Enhanced signal 1
z_2: Enhanced signal 2
z_3: Enhanced signal 3

How would I use the function mir_eval.separation.evaluate to evaluate the result, since it currently only allows K reference signals and K target signals but does not have an additional input for noise signals.

If we find a good solution, we may add it to the docs later.

Answer 1 · 2017-07-26T11:01:02.000Z

I am not sure if I understand your setup. Maybe I don't really understand what the three estimates would represent? Can you explain this in more detail, please?
Also, in speech enhancement people are more using perceptual quality evaluation measures such as pesq. So I am not sure if bsseval would be the best fit here.

Answer 2 · 2017-07-26T11:16:32.000Z

I think there are several things to say on this matter, and two main routes for dealing with that issue

A/ having noise image

First of all, I strongly advise NEVER to use bsseval_sources, because that evaluation actually changes the reference as a function of the estimate to compute the score. It is very easy to devise trivial estimates that always get extremely high scores with this method.
Then, answering this question (now assuming you want to use bsseval_images), I think there is an issue only because the problem is ill-posed. You indeed simply need a reference for noise, considering your observed mixture is actually the sum of the image of speaker1 + speaker2 + true noise. If you had this noise reference, you would have 2 times 3 inputs, as required.

=> To me, the best practice is hence to always know the true noise, as well as the true "target" source images, for evaluation.

If this is not the case and you only know target sources, and not the actual images along with the true noise, this raises the related interesting question:
=> how to estimate the image of those two sources withing the mix, as well as a further image for the noise?

This question is probably ill-posed, because it requires some prior assumptions on what noise should be like. If we are to implement this computation of the target images + noise within the evaluation function, this means we are going to arbitrarily make such an assumption for computation. That said, since these computations DO NOT use the estimates, but only the references and the mix, it is ok we are not going to have flaws as in the case of bsseval_sources that exploits the estimates to compute references. However, I don't see a particular consensus in how to compute these "groundtruth images" from the groundtruth sources. Anyways, this would appear as a separated module that has no particular connection with mir_eval.

B/ trying all combinations

A simple solution could be to simply try all the possible combinations, and to just discard the input source that gives the worst performance as being the noise source. This would allow inputing +1 estimate, at the cost of doing more computations.
Still, doing this actually DOES NOT totally solve the problem I see with your setup, because it probably means using bsseval_sources, which I again strongly advise you not to do.