benfmiller/audalign

[Request/Suggestion] Support unpredictable frame drops and unmatching speed/pitch (drift correction)

Opened this issue · 1 comments

I'm looking for a possibility to perform (potentially destructive) audio tracks synchronization from old (dubbed in different language) and remastered versions of movies.

In my scenario, applying single audio shift is not enough: sooner or later audios become out of sync at least due to

  • unpredictable frame drops in both tracks
  • unmatching overall average speed (often with higher pitch for faster audio)

Any interest in supporting such a scenario?

Any existing projects that try to accomplish this problem?

Any ideas what's the best way to implement it?


Naive idea for implementation:

  • do initial synchronization
  • until old dubbed audio ends
    • detect whether segment potentially contains voice (with something like silero-vad) or something non-silent/non-voiced (ideally, music segment)
    • somehow measure tempo difference between the old and new audio segments
      • if it's voice — recognize it (with something like whisper.cpp) and compare time differences of first and last word of the segment, between old and new audio segment
      • if it's something else — probably just compare differences of two most loud points of old and new audio segment
    • shrink/stretch (speedup/slowdown) the (old, dubbed in other language) audio segments (the possible analyzed non-silent/non-voiced segment and any next N segments)
    • repeat

Thanks!

Sorry for the late reply, and thanks for the suggestion!

Audalign currently has a "locality" feature, which breaks up audio files into segments and aligns based on the strength of the match between segments of the audio file (more info in wiki). This could be relatively easily used to stretch the audio files, but wouldn't handle frame drops.

It looks like AudioAlign's graph/feature is purely based on correlation? I don't have much time to work on this in the near future, but if it's an easy change I'd be happy to work on it. Or, I'd gladly accept pull requests!

silero-vad and whisper look like a neat idea for a new recognizer! For this case, would translated audio segments necessarily line up with word starts and ends? Would translated segments be viable as time markers, or would shrink/stretching have to be done based on the background?