iranroman/musicinformationretrieval.com

Aligning audio using DTW?

monamalhotra opened this issue · 2 comments

Hey,

Great job on the examples, really helpful. I am trying to align audio sequences of the same sentence spoken by different speakers (different dialects and speed of speaking). I'm following your DTW example with different audio speeds.

As I am really new to audio processing, can you please guide me as which features would work better in this case - MFCC or Chroma or STFT?

I want to warp/align my target audio to the reference audio after computing the DTW path. How do I use the path to speed up my slow audio to the fast one or vice versa? Can this be done using librosa as well? Any example on how to save the final aligned audio?

Thanks for the kind words. Of those three choices of features, I think MFCCs would be the best suited for tracking changes in speech phonemes among different speakers.

Your second question is a bit more involved, but in short, you use the output path from DTW to time scale parts of the original time-domain signal. You might be able to use time_stretch from librosa: http://librosa.github.io/librosa/generated/librosa.effects.time_stretch.html#librosa.effects.time_stretch

@monamalhotra, I've had great success getting logical alignment with MFCC. I'm now to the scaling step. Did you ever have any luck with that?