How do the results have to be interpreted
vladfulgeanu opened this issue · 3 comments
Hello!
I tried to use the python implementation to detect voice for the first 100s from this video:
https://www.youtube.com/watch?v=gYdHyeo0eec
And these are the results on the spectogram:
First of all, why are there positive results during the first 47 seconds? Is it just the model not being trained to disregard music?
And secondly, is there a way to get results merged together whenever it detects voice? so that there won't be intervals of just fractions of a second one after another?
Thanks very much in advance!
Actually the training set of our vad doesn't have music sound. The result you uploaded can become better if some post-processing is applied. But we have only matlab version post processing. Anyway, the matlab script is quite simple, you can easily change that script into python. (if you are good at the numpy) If you needed, I will share that post-processing script for you
@jtkim-kaist Is there an estimated date for a full python implementation (training & testing - with post processing)?
@vladfulgeanu
We upload the end-point detection (EPD) algorithm to https://github.com/jtkim-kaist/end-point-detection
(it may be you want because EPD find the start and end point of the speech signal.)
However, the used VAD in EPD project is shallow CNN based so the performance may be worse than this project. Also the hyperparameters may be changed for your applications.
The work that joining this project and EPD will be done in someday, but it is hard to indicate the exact date.