How do the results have to be interpreted

Question

How do the results have to be interpreted

vladfulgeanu opened this issue 7 years ago · 3 comments

Hello!

I tried to use the python implementation to detect voice for the first 100s from this video:
https://www.youtube.com/watch?v=gYdHyeo0eec

And these are the results on the spectogram:

First of all, why are there positive results during the first 47 seconds? Is it just the model not being trained to disregard music?
And secondly, is there a way to get results merged together whenever it detects voice? so that there won't be intervals of just fractions of a second one after another?

Thanks very much in advance!

Answer 1 · 2018-05-15T10:44:48.000Z

Actually the training set of our vad doesn't have music sound. The result you uploaded can become better if some post-processing is applied. But we have only matlab version post processing. Anyway, the matlab script is quite simple, you can easily change that script into python. (if you are good at the numpy) If you needed, I will share that post-processing script for you

Answer 2 · 2018-05-15T10:51:37.000Z

@jtkim-kaist Is there an estimated date for a full python implementation (training & testing - with post processing)?

Answer 3 · 2018-06-24T06:41:02.000Z

@vladfulgeanu
We upload the end-point detection (EPD) algorithm to https://github.com/jtkim-kaist/end-point-detection
(it may be you want because EPD find the start and end point of the speech signal.)

However, the used VAD in EPD project is shallow CNN based so the performance may be worse than this project. Also the hyperparameters may be changed for your applications.

The work that joining this project and EPD will be done in someday, but it is hard to indicate the exact date.