train stage1, not use audio feature, only learn the image generation?
Opened this issue · 1 comments
monkeyCv commented
Hi, authors. Thanks for your greate work. I have a question about stage1 training. It doesnot have an input of audio feature. So, what is the meaning of the stage1. Just think that, we have same ref image and same ref image embedding, but we have to generate two different images? Thanks.
xumingw commented
It trained the referencenet and spatial part of denoising unet. Given a ref image, it should generate a random image but keeping main feature of refimage