This project aims at creating an automatic alignment between the textual lyrics and monophonic singing vocals (audio). This system shall be very useful in a setting where a karoake performer would want to keep in sync with the background track. Traditional Hidden Markov Models are used for phoneme modelling and an interesting structural segmentation approach has been explored to break the audio (usually of length 4-5 minutes) to smaller chunks that are structurallly meaningful (Intro, Verse, Chorus, etc) without any implicit assumptions.
- [HTK tool-kit] (http://htk.eng.cam.ac.uk/download.shtml)
- [sph2pipe] (https://www.ldc.upenn.edu/language-resources/tools/sphere-conversion-tools)
- [Flite] (http://www.speech.cs.cmu.edu/flite/download.html)
- [MSAF] (https://github.com/urinieto/msaf/releases)
- Create initial hmm models (isolated phoneme training)
tcsh scripts/model_gen.sh <phonelist> <proto_file>
- Create connected HMM models (embedded re-estimation)
tcsh script/embedded_reestimation.sh <iterations>
- Align Damp dataset with the generated HMM Models using forced Viterbi alignment
- Perform embedded reestimation using the Damp Dataset to refine the phoneme models.
- Use MSAF library to segment Damp training data into structural segments
python scripts/msaf_segmentation.py <wav_in_dir> <wav_out_dir>
- Create MLF files corresponding to the segmented audio
python scripts/msaf_to_mlf.py <labfile_list>
- Perform embedded reestimation within these segments to get the final phoneme models
- To test any model do the forced Viterbi alignment initially
sh scripts/force_align.sh
Set the parameters such as model, features, mlf, dictionary, etc inside the file.
- To evaluate the performance of the model, use the manually annotated groundtruth and compute overlap.
python scripts/lab_to_lrc.py <lyrics_list>
Set the groundtruth and output folder inside the script.
- Phoneme Acoustic Modelling - Rupak Vignesh
- Structural Segmentation with MSAF - Benjamin Genchel
- Thanks to Alex Lerch for his guidance
- S Aswin Shanmugham's hybrid segmentation framework
- Stanford's DAMP dataset.