Simple audio segmenter to isolate speech portion out of audio streams. Uses a simple feedforward MLP for classification (implemented using tensorflow
) and heuristic smoothing methods to increase the recall of speech segments.
This version modified from brandeis-llc repository to use applause, speech, music, noise, and silence as possible labels, and to handle binary classification of applause (rather than speech).
- System packages:
ffmpeg
- Python packages:
librosa
tensorflow
ortensorflow-gpu
>=2.0.0
numpy
scipy
scikit-learn
ffmpeg-python
We provide two pretrained models. Both models are trained on 3-second clips from the MUSAN corpus, HIPSTAS applause samples, and sound from Indiana University collections using the labels: applause
, speech
, music
, noise
, andsilence
. The models are, then, serialized using tensorflow::SavedModel
format. The applause-binary-xxxxxxxx
model is trained to predict applause vs non-applause; the non-binary-xxxxxxxx
model uses all the above labels. Because of the distribution bias in the corpus (a lot fewer noise and silence samples in the training data), we randomly upsampled minority classes.
To train your own model, invoke run.py
with -t
flag and pass the directory name where training data is stored. Each file in your training set should have its label included at the start of the file name, followed by a -
; for example applause-mysound124.wav
(see extract_all
function in feature.py
)
To run the segmenter over audio files, invoke run.py
with -s
flag, and pass 1) model path (feel free to use the pretrained model if needed) and 2) the directory where audio files are stored. Currently it will process all mp3
and wav
files in the target directory. If you want to process other types of audio file, add to or change the file_ext
list near the bottom of run.py
files.
If you want to use binary classification, include the -b
flag.
If you want to specify a minimum length of segment, use the -T
flag and specify a number of milliseconds. Shorter segments will be merged with the previous one (short segments at the beginning will be omitted).
For example:
python run.py -s /path/to/pretrained/applause-binary-20210203 /path/to/audio -o /path/to/output/folder -T 1000 -b
The processed results are stored as JSON file in the target directory named after the audio input. The JSON includes a label and start & end times in seconds. For example:
[
{
"label": "non-applause",
"start": 0.0,
"end": 0.64
},
{
"label": "applause",
"start": 0.65,
"end": 6.78
},
{
"label": "non-applause",
"start": 6.79,
"end": 373.83
},
{
"label": "applause",
"start": 373.84,
"end": 379.55
}
]
To build as root:
sudo singularity build applause-detection.sif applause-detection.def
To build as non-admin:
singularity build --fakeroot applause-detection.sif applause-detection.def
To run:
./applause-detection.sif <input_output_tmp_dir> <min_segment_duration>
The input audio file should have been copied to the temporary directory specified above, and the output file will be generated in the same location with .json extension added to the input filename.