DanielSWolf/rhubarb-lip-sync

Proof of concept for Rhubarb 2

DanielSWolf opened this issue · 13 comments

Rhubarb Lip Sync 2 will be a full rewrite based on my learnings from version 1 (see #95 ). As a first step, I'll create a proof of concept (PoC) to test the new libraries and approaches I'm planning to use. As a use case, I've chosen the Italian fan dub project of Thimbleweed Park. Once my PoC is good enough to give them Italian lip sync that's better than the current results using v1, I'll start work on v2 in earnest.

Here are the steps for the PoC (for details see below):

  • Create an Italian pronunciation dictionary
  • Build a G2P model using Phonetisaurus
  • Assemble an Italian speech corpus
  • Train the Montreal Forced aligner on the Italian speech corpus
  • Hack Rhubarb 1 to accept pre-aligned phonemes
  • Write scripts to create a rough tool chain
  1. First of all, I need an Italian pronunciation dictionary -- that is, a file that contains pronunciation information for several thousands of Italian words. This proved to be the first hurdle: There just isn't any existing (free) Italian pronunciation dictionary with the required size. So I'll roll my own based on data from Wiktionary.
  2. Once I have a basic pronunciation dictionary, I can use an existing machine learning tool (Phonetisaurus) to guess the pronunciation of any unknown word based on the pronunciation dictionary. This should be pretty painless, and it will allow me to calculate the pronunciation of any Italian sentence based on its dialog file (ignoring things like numbers of abbreviations; but there don't seem to be many of these in Thimbleweed Park).
  3. Next, I'll need a large Italian speech corpus, that is, a collection of recorded speech with transcripts. There are many freely available speech corpora, but none of them have the required amount (about 1,000 hours) of Italian speech. So I'll have to mix and match.
  4. Next, I'll train an existing forced alignment tool (Montreal Forced Aligner) on a large corpus of Italian recordings, allowing me to align arbitrary recordings later on.
  5. The last missing ingredient is a way to turn aligned phonemes into animation. To do this, I'll hack the existing Rhubarb v1 engine to accept pre-aligned phonemes as input, bypassing the normal recognition phase.
  6. At this point, I'll be able to take an Italian dialog file, calculate its pronunciation, align the pronunciation with the recording, and thus get the exact timing information of which phoneme is said when. I can then plug this information into the hacked Rhubarb 1 engine and hopefully get perfect Italian lip sync. Putting it all together will require some scripting.

Step 1, the Italian pronunciation dictionary, is done. I recently released WikiPronunciationDict, a multilingual pronunciation dictionary that currently contains pronunciations for about 90,000 Italian words.

Step 2, the G2P model, is done. I've trained Phonetisaurus on the Italian WikiPronunciationDict, and the resulting model does a really good job at guessing the pronunciation of Italian words. Here are some pronunciations guessed by my model.

Note: Pronunciations for non-Italian words are off, as was to be expected.

bip	b i p
chuck	ʃ a k
chuck	t ʃ a k
sembra	s e m b r a
sembra	s ɛ m b r a
ransome	r a n s o m e
ransome	r a n z ɔ m e
ransome	r a n s ɔ m e
delores	d e l o r e s
delores	d e l ɔ r
sul	s u l
bù	b u
quel	k w e l
quel	k e l
quel	k w ɛ l
thimbleweed	t i m b e w i d
chei	k ɛ i
chei	k e i
park	p a r k
willie	w i lː j e
willie	w i lː i a
willie	w i lː i e
thimblecon	t i m b l ɛ k o n
be’	b ɛ
be’	b e
dell’hotel	d e lː o t ɛ l
aver	a v e r
aver	a v ɛ r
edmund	e d m u n d
edmund	ɛ d m u n d
rino	r i n o
ray	r a j
ray	r ɛ i

Regarding step 3, I've decided to use the Italian MLS corpus. It has 247 hours of Italian speech including transcriptions.

I considered adding in the Italian Common Voice corpus, but decided against it. The total amount of validated Italian recordings there is 158 hours, but I'd like to maintain a 1:1 ratio between male and female recordings. The problem is that only 68% of Italian Common Voice recordings contain speaker information, and out of these, only 21% are spoken by women. This means that after filtering, I'd be left with only about 44 hours of Common Voice speech. It just doesn't seem worth the effort, given that the 247 hours from MLS should already be sufficient (and has a perfect 1:1 ratio). What's more, all other languages I'm currently interested in (English, French, German) have much more recordings in the MLS corpus. So if MLS proves to be enough for Italian, it should be enough for any language.

Hey @DanielSWolf thank you for your effort to develop Rhubarb2! Did you find out if MLS corpus provides enough data?

@giuliogatto Not yet. I've decided to set up this test step in a manner that I can re-use after the proof of concept. This involves a virtual server, Docker, and quite some scripting. I'm working on it, but I don't have results yet.

Thank you for the update!

I'm one step further! I've successfully trained the Montreal Forced Aligner with the Italian MLS corpus. It turned out that this corpus is more than large enough for my needs! 😄

I've attached a screenshot from Praat showing the resulting alignment for one of Delores' lines from Thimbleweed Park. The speech alignment down to the individual phoneme is excellent, at least as good as the English alignment in the current version of Rhubarb.

image

Hey, @DanielSWolf thank you for the update! The alignment looks excellent, keep up the good work!

Hey Daniel! This is a really cool piece of software and I've been searching for something like this for quite a while.

I'm a french Canadian animator and I often do lip sync work in french. V1 has some issues here and there when using the phonetic recognizer, but this new multi-language version looks like a great improvement.

I'd love to contribute if I can. Let me know if I can help with testing or anything else. I could also help out with After Effects scripting if needed.

Cheers and thanks for your work again!

Hi @seblavoie, That's nice of you to offer! I've still got a lot of work before me to get the PoC running and to develop V2. Once that's done, adding more languages will be much easier. But it will still require good knowledge about the individual languages (things like the pronunciation of numbers, dates, abbreviations etc.). At this point, I'll be grateful for support from you as a native speaker!

Thanks for the quick reply! Alright then, let me know if there's anything I can do to help 🤙

It's done! I've hacked Rhubarb 1.12 to accept the output of the Montreal Forced Aligner as input. This allows me to perform Italian lip sync for the Italian fan dub project of Thimbleweed Park.

I'm very happy with the results: Not only do they look far better that what was previously possible with the built-in phonetic recognizer; in places, they even surpass the animation quality you currently get for English recordings. 😃

Here's a short demo video.

PoC.mp4

Now that the proof of concept is done, I'll soon start with the actual coding.

A bit of warning, though: It's a long way from a proof of concept to a production-ready product. Don't expect Rhubarb 2.0 for at least another year!

That looks very promising! Keep up the good work!