train stage1, not use audio feature, only learn the image generation?

Question

train stage1, not use audio feature, only learn the image generation?

Opened this issue 5 months ago · 1 comments

Hi, authors. Thanks for your greate work. I have a question about stage1 training. It doesnot have an input of audio feature. So, what is the meaning of the stage1. Just think that, we have same ref image and same ref image embedding, but we have to generate two different images? Thanks.

Answer 1 · 2024-07-16T06:54:12.000Z

It trained the referencenet and spatial part of denoising unet. Given a ref image, it should generate a random image but keeping main feature of refimage