hcmlab/vadnet

Use as Music Detector

anavc94 opened this issue · 4 comments

Hello,

first of all, congratulations for this project. The results I got using this library really surprised me. Thanks!

I was wondering if it would easy to detect where music segments appears in the "noise file" because I realized that radio or TV tunes are always classificated as noise (as it should be) but I am also interested in getting these parts.

Any suggestion would be appreciated! Thnx

It's should be pretty easy to find sources of pure music on the web, so the more tricky part will be to find sources that include anything BUT music (or at least only to marginal part). News or talk shows might be worth a shot. Another possibility could be radio, as long as you know when they play music (maybe there's a trigger similar to the subtitles that we use to label speech in movies). Once you have a reliable source, you will have to implement a class that derives from SourceBase (see source\base.py). The function next is supposed to return a matrix of size number_of_frames x frame_size and a vector of size number_of_frames, which assigns a label id to each frame, e.g. 0 = noise and 1 = music (names should be returned by get_targets). Have a look at audio_vad_files.py and you'll quickly understand the mechanism. Finally, you will have to replace the --source parameter in do_train.cmd with your new class, e.g. --source source.music.MyMusicSource. Hope that helps.

Hello, @frankenjoe ,

Thanks for the quick response and all the explanations, I will try it.
I'm a begginner with these kind of algorithms so I really appreciate the help.

Have a nice day!

hello, i am interested in this thread. Apart from the challenge of finding data, is it possible making vadnet detect voice, music and 'other' simultaneously? i mean, having 3 different labels. thanks!

It is possible to train on several classes at once by using indices 0..n_classes-1. In your case when implementing SourceBase let get_targets return the three names (e.g. voice, music, other) and in the annotation that is returned along with the audio frames use 0, 1, 2 to represent these labels. Otherwise, use two classifiers: voice vs other and music vs other. The latter has the advantage that both voice and music can be detected at the same time, which usually is the case when there is a singing voice.