kxhit/EscherNet

Training details

fradif96 opened this issue · 3 comments

Hello! Congratulations for the great work.
I have one question about the training process. In Section 3.1 you say "It builds upon an existing 2D diffusion model, inheriting its strong web-scale prior through large-scale training". However, in the rest of the paper, it is unclear if the overall architecture is trained from scratch on the Objaverse dataset (rendered as Zero123 does), or if it is fine-tuned by starting from some pre-trained modules of Stable Diffusion. Could you please clarify my doubts?
Thanks in advance

Hi @fradif96, the network is fine-tuned from the StableDiffusion v1.5 checkpoints.

Thanks! How do you handle the training of the modified cross attention and self attention layers? Do you substitute the modules with complete new ones or do you use the pre-trained ones even if the data processed is different (e.g. text vs encoding of images)? And finally fine tune the overall architecture?

kxhit commented

Hi @fradif96
We don't have new modules for cross/self-attention. It's the same attention layers but just reshape the latent features from ((b t) l d) -> (b (t l) d) here. The trainable layers are UNet, ConvNext image encoder here (we didn't use CLIP as it only captures high-level semantics), the VAE is frozen, and we fintuned from SD1.5 inited here.

Hope it helps and please let me know if something is still unclear.