wangyuchi369/InstructAvatar

Concerns about InstructAvatar | 对InstructAvatar的质疑

Opened this issue · 2 comments

This article presents some claims and methods, prompting me to write a comment to express some of my doubts.

1. Is this the first text-guided 2D-based talking face generation framework?
There have been several text-guided talking face generation methods before, such as EAT, Style2Talker, TalkCLIP, and AgentAvatar. However, this paper's introduction does not mention these works and claims to be the first to tackle this task with a 2D-based approach, which seems inappropriate. Moreover, the method in this paper is quite similar to AgentAvatar, TalkCLIP, and Style2Talker, all of which generate text descriptions based on AUs. Even though this paper introduces GPT-4V, it merely corrects AU errors. The paper does not discuss or compare itself with these existing methods.

2. The AU action control demonstrated in this paper is consistent with the expressions in the CC dataset. Can it generate other AU actions?
The AU-controlled expressions shown in this paper (e.g., "Raise your cheek and pull your lip corners") are identical to the expressions already present in the CC dataset (e.g., smile). This raises doubts about whether InstructAvatar can generate other AU actions different from those in the CC dataset, such as purely raising an eyebrow or purely frowning. Generating action combinations already present in the dataset is relatively easier.

3. Many of the audio samples in this paper are from the MEAD dataset. Can InstructAvatar achieve good results with in-the-wild long audio clips?

Finally, thank the authors for their efforts in the field of talking head generation.


这篇文章的一些说法和做法有些意思. 所以想写个comment, 说明心中的一些疑惑.

1.本文是否是the first text-guided 2D-based talking face generation framework?
很久之前其实有一些text-guided talking face generation方法, 如EAT, Style2Talker, TalkCLIP, AgentAvatar, 但本文在introduction没有提及这些工作, 并说明自己是第一个做这个任务的2D-based方法, 这似乎有些不合适. 而且, 本文的方法和AgentAvatar, TalkCLIP, Style2Talker比较类似, 都是基于AU生成文本描述. 本文即便引入了GPT4V, 但也只是纠正AU的错误. 本文并未与这些方法进行讨论或比较.

2.本文展示的AU动作控制, 和CC数据集里已有的表情一致. 本文是否能生成其他AU动作?
本文展示的AU控制表情(如Raise your cheek and pull your lip corners.)和CC数据集中已有的表情( 如smile)一致, 这使人怀疑InstructAvatar是否能生成其他与CC数据集中表情不同的AU动作, 比如单纯的扬眉, 单纯的皱眉等. 生成数据集中已有的动作组合是相对容易的.

3.本文的音频很多是MEAD数据集中的音频, InstructAvatar是否在in-the-wild的长音频上能取得良好效果?

最后, 感谢作者在talking head generation这一方向的努力.

您好!感谢您的关注与疑问。看到您为此特意注册了github平台进行讨论,我们感受到了您对科研的热情与浓重好奇心。因此,我们非常重视您的问题,同时也发现您的一些建议很有insight,让我们重新梳理审视了这篇文章。

关于第一点,感谢您对talking head这一领域的关注。首先我们想强调我们原文中To our best knowledge, it is the first text-guided 2D-based talking face generation framework. 中”2D-based”这个限制词,我们在Related Works里提到了一些pioneering的工作如您提到的TalkCLIP,ExpCLIP, Media2face等。但这些工作是3D-based的,目标是生成3D animations,如果想得到视频,使用off-the-shelf的render即可。但我们的方法是直接emit视频,这两种范式是不一致的。对于其他模型,EAT的确有提到使用了CLIP编码文本信息,但其并不参与主模型的训练,而是以CLIP loss的形式用于zero-shot editing这个拓展任务,且我们也选择了这一模型作为baseline之一。Style2Talker根据arxiv上的信息,即使今天来看依然是同期工作。AgentAvatar我们确实遗漏了,但其也是3D-based的方法,使用PD-FGC的render渲染,另一方面它更注重生成一个agent(即提供一个交互的environment而非直接用文本guide avatar的动作)。同时,我们也尝试了去比较一些模型,但一些模型如TalkCLIP,ExpCLIP并没有开源,我们联系了作者也未收到回复,因此很难展示其结果。我们已经try our best来比较模型,尽可能多地选取了不同emotion guidance类型的baselines。不过,我们依然感谢您提出这些related works。这些社区内的新进展令人兴奋,我们在可能的未来版本也会引用这些工作或者重新书写Intro,同时我们也期待在他们开源后进行实验上的比较,让我们的文章更为solid。

关于第二点,AU可控制表情的范围,您的一些想法是reasonable的。事实上,我们也在Conclusion里的Limitation里提到了这件事:our model is trained solely on a combination of action units extracted from real talking videos. This dependency between action units may limit its ability to precisely control a disentangled single action unit. 一方面,我们的模型当然不会只局限于数据集里的表情,例如我们demo视频中的前三个视频就是人为输入的,很难在数据集中找到完全一致的描述(事实上,intuitive来讲,不同表情间的AUs会有一致的部分,也会有不一致的部分,模型学习时通过对比也能学到在某种程度上的解耦表示)。另一方面,正如我们在limitation里承认的那样,在实际的talking head数据集上训练导致了表情的多样性下降,没有显式的解耦机制使得对单个au的精确控制变得困难,这一点上你的concern是对的,也是我们下一步计划改进的方向。

关于第三点,事实上,我们也在Limitation里有提到,the relatively modest size of our training dataset may hinder its robustness when faced with highly out-of-domain instructions or appearances.. 对于ood的音频我们倒没有那么担心,因为我们的模型是从Microsoft的GAIA模型finetune得到, 这一模型通过大量音频的学习,已证明了有较好的对ood音频的处理能力,我们也在demo视频里特地用了相对ood的音频如歌声、TTS生成的音频进行测试,模型的表现尚可。不过,我们承认MEAD数据集的有限性,导致其对OOD的appearance和instruction的控制能力可能不是总能令人满意。

总的来说,很感谢您的一些建议,特别是你第二点第三点concern,这和我们提到的limitation不谋而合,也是我们之后改进的一些方向。同时,我们看到第二点第三点某种程度上也是data的限制,随着data和model的scale up,我们也期待着talking head社区可以做出更为惊艳的工作!

如果您还想进一步讨论,欢迎email me: wangyuchi369@gmail.com


Hello! Thank you for your attention and questions. We appreciate that you registered on GitHub specifically for this discussion, as it shows your enthusiasm and deep curiosity for scientific research. We highly value your questions and have found some of your suggestions to be very insightful, prompting us to review this paper.

Regarding your first point, thank you for your interest in the talking head field. We want to emphasize the term "2D-based" in our original statement, To our best knowledge, it is the first text-guided 2D-based talking face generation framework. In our Related Works section, we mentioned some pioneering efforts such as TalkCLIP, ExpCLIP, and Media2Face. However, these works are 3D-based and aim to generate 3D animations. To obtain videos, they use off-the-shelf rendering techniques, whereas our method directly emits videos, making these paradigms inconsistent. Regarding other models, EAT does mention using CLIP to encode text information, but it doesn't participate in the main model's training and is used as a CLIP loss for zero-shot editing. Besides, we also chose EAT as one of our baselines. Based on information from arXiv, Style2Talker remains a contemporaneous work. We did overlook AgentAvatar, but it is also a 3D-based method using PD-FGC for rendering and focuses more on creating an agent (i.e., providing an interactive environment rather than directly using text to guide the avatar's actions). We attempted to compare various models, but some, like TalkCLIP and ExpCLIP, have not been open-sourced, and our attempts to contact the authors were unsuccessful, making it challenging to present their results. We have tried our best to compare models and selected as many different emotion-guidance baselines as possible. Nonetheless, we appreciate your mention of these related works. The advancements in the community are exciting, and we may cite these works or rewrite the Introduction in future versions, while also anticipating experimental comparisons when they are open-sourced, to make our paper more robust.

Regarding your second point about the range of expressions that AUs can control, some of your ideas are reasonable. In fact, we mentioned this in the Conclusion under Limitations: Our model is trained solely on a combination of action units extracted from real talking videos. This dependency between action units may limit its ability to precisely control a disentangled single action unit. On one hand, our model is not limited to the expressions in the dataset; for example, the first three videos in our demo are manually inputted, and it is difficult to find identical descriptions in the dataset. ( Intuitively, there will be consistent and inconsistent parts of AUs among different expressions, and the model can learn a somewhat disentangled representation through contrast). On the other hand, as we admitted in the limitations, training on actual talking head datasets reduces the diversity of expressions, and the lack of an explicit disentangling mechanism makes precise control of a single AU difficult. Your concern on this point is valid and is one of our future improvement directions.

Regarding your third point, we also mentioned in the Limitations that "the relatively modest size of our training dataset may hinder its robustness when faced with highly out-of-domain instructions or appearances." We are not overly concerned about OOD audio because our model is fine-tuned from Microsoft's GAIA model, which, through extensive audio learning, has proven to handle OOD audio well. We specifically used relatively OOD audio like songs and TTS-generated audio in our demo videos, and the model performed satisfactorily. However, we acknowledge that the limitations of the MEAD dataset may affect its ability to handle OOD appearances and instructions consistently.

Overall, we greatly appreciate your suggestions, particularly your concerns regarding points two and three, which align with our mentioned limitations and are areas for our future improvements. Additionally, we recognize that these concerns also relate to data limitations, and we anticipate that as data and models scale up, the talking head community will produce even more impressive work!

If you would like to discuss further, feel free to email me at wangyuchi369@gmail.com.

非常感谢您的解释, 但这些解释不足以解除疑惑.
您用2D-based来解释first, 这可能会误导不了解这个领域的审稿人, 让他们认为这是第一篇文本控制的文章.
而且您的文本标注与之前多个方法如出一辙, 您又没有明显比较标注方法, 只在related work里悄悄说下, 让人看起来这是你的novelty.
动作控制都是cc数据集里有的控制, 全是集内动作, 您又claim您能实现precise的motion control, 又让不了解的审稿人可能产生误解.

这种做法, 实在让人难以尊敬. 而且作者列表里还有一些msra的老师. 不清楚审到这篇文章的reviewer的反应是什么.