
Compare DiS, Diffusion-RWKV, and Dimba?

Opened this issue · 4 comments

Could you tell me which one performs better among DiS, Diffusion-RWKV, and Dimba? Is there any relevant literature for comparison and explanation?

feizc commented

Hi, DiS, Diffusion-RWKV, and DiT are tag-based image generation, while Dimba, PixArt and SDXL are text-to-image models. For structure analysis, Mamba2 paper also show that mixture of both provide more interesting results.


Thank you very much for your reply. I think whether it's a generation model based on tags or based on text, we can see them as condition-based generation models. The question is how to integrate conditions into the model through a structure similar to cross attention.

Besides, I want to discuss another problem with you. Do you think it's promising to apply models like DiT, DiS, Diffusion-RWKV, Dimba to the generation of other types of data, such as point clouds, human motion, etc.?

Of course, I know that applying diffusion models to such modal data has already become very popular. I'm interested in knowing if you have any expectations regarding the potential benefits of using Mamba RWKV or Mamba+Transformer on the diffusion generation models for these types of modal data?

feizc commented

To be honest, I quite agree with the viewpoint of data-centric, i.e., data quality is all you need. When the data is constrained the same, I tend to prefer hybrid structures (Mamba+Transformer) to have more advantages 😊.

Thank you so much for taking the time to respond to my question!