hutaiHang/Faster-Diffusion

About Parallel encoder

sonwe1e opened this issue · 4 comments

Great work on the study, but I have some queries I'd like to ask.

If the time-steps considered as non-key directly skip the encoding step of the encoder, how are the images decoded from the features encoded by the key time-step encoder used in these non-key time-steps? Since the encoders at non-key time-steps are skipped, there wouldn't be any encoding at time t+1 either. Why not skip the non-key phases altogether?

My point is that if time t is a key moment, and t+1, t+2, t+3 are non-key, this means that the decoders for t+1, t+2, t+3 all use the features f_t from time t. According to the parallel steps in the paper, t+1, t+2, t+3 all need to decode f_t, but these time steps do not utilize the encoder. So, what is the purpose of the results obtained from this decoding?

I hope I have made my question clear, Thanks

My point is that if time t is a key moment, and t+1, t+2, t+3 are non-key, this means that the decoders for t+1, t+2, t+3 all use the features f_t from time t. According to the parallel steps in the paper, t+1, t+2, t+3 all need to decode f_t, but these time steps do not utilize the encoder. So, what is the purpose of the results obtained from this decoding?

I hope I have made my question clear, Thanks

Even though the encoder of UNet is not used during non-key timesteps, its decoder receives shared encoder features from key timesteps, then outputs the predicted noise $\epsilon$, to updates $z_t$. I hope I understand your question correctly.

Thank you for your answer, it has nicely resolved my doubts. I made a silly mistake.

Thank you again for your response. I have another question. From the graph, it seems that a smaller interval in the Uniform method means fewer skipped encoders, which should mean it's closer to the original diffusion process. But why then is the performance of I worse than that of II
image