This looks extremely similar to Paella (not sure which one is the better approach)
Mut1nyJD opened this issue · 5 comments
The only difference is that they use Masked tokens while they use noised tokens
not sure which one is the better approach
we'll just have to get the code out there for people to try!
@Mut1nyJD it goes way back actually, to Mask-Predict, VQ-Diffusion, then the breakout happened with MaskGit, followed by Phenaki
Paella is basically MaskGiT, but all convolutions. Not sure if I believe in that, after all that I have seen
True I completely forgot about Phenaki because it was tailored to Video but in the end yes you are right.
So what's the big difference / novelty between this and Phenaki does not seem obvious to me by skimming through their project page
@Mut1nyJD the battle is far from over
i'm guessing someone will try an all-attention approach for latent diffusion next. they also did not compare to progressive distilled ddpm models, so the jury is still out on what is more efficient
@lucidrains There was a paper out in December by William Peebles building a latent diffusion model with only ViT-style attention blocks. From a cursory glance, adding residual gating and using a really high EMA update factor were essential for training stability. Unfortunately, they only published quantitative results on ImageNet, and also did not compare results with distilled DDIM models.
https://arxiv.org/pdf/2212.09748.pdf
https://www.wpeebles.com/DiT.html
https://github.com/facebookresearch/DiT