Reaching EMO level quality

Question

Reaching EMO level quality

varunjain99 opened this issue 7 months ago · 1 comments

Thanks for the great work on the project! The approaches of AniPortrait and EMO share a lot of similarities, especially in the use of SD1.5, ReferenceNets, and AnimateDiff type motion modules.

Of course, we have limited visibility into EMO outputs, but it seems to be better. This begs the following questions:

What are the top reasons for difference in quality of AniPortrait and EMO?
What are the most impactful possible improvements to improve Aniportrait video quality?

Which of the following do folks think is the biggest reason for quality gap:

Training dataset (size and quality)
Use of intermediate representation (EMO conditions on audio directly)
Weak guiding conditions used in EMO
Architectural differences in the Lmk2Video stage
Other

The authors mention the use of intermediate representations in the paper as a limitation. Certainly, it seems the Audio2Lmk model could be trained more effectively or can be done away with. To my eye, however, the most stark quality issue seems to be with teeth artifacts (e.g. blurs, temporal inconsistency, generally weird looking teeth). This leads me to believe the discrepancy in quality is either from dataset used or some architectural difference in Lmk2Video. Any idea what's most to blame for these types of artifacts?

Curious what the community + authors think - especially around causes of lesser image / video quality compared to EMO!

Answer 1 · 2024-05-22T19:29:44.000Z

I am also curious about that. I think the problem is in the script/libraries/programs used to restore and animate face realisticly. I think we lack the face/body movements base on models used to recreate it. We should implement something that is trained on realistic face/body movements and then it should be applied on character in the process.

I found this program also contributed by Alibaba team:
https://github.com/ali-vilab/dreamtalk
and it is based on this program:
https://github.com/RenYurui/PIRender
and it uses this program:
https://github.com/sicxu/Deep3DFaceRecon_pytorch/issues
also this script was used for audio driven movements by PIRender:
https://github.com/simonalexanderson/StyleGestures

So it is audio driven and it generates random movements of the body and also make "realistic" face movement. I think that's what we need. Also as far as I noticed it generates anatomical movement of the neck (i think) so this is something that we can observe in EMO. It is a little bit simpler than EMO but I think it is the base(?)

I may be wrong, but those programs/technologies seems to be connected to EMO. Also for some reason DreamTalk took a step back with the access for the public so it may be it.