
About the implementation on multi-scale condition.

XiaoqiangZhou opened this issue · 1 comments

Thanks for sharing this great work.

In the paper, you mentioned that "transfer rich multi-scale texture patterns from the source image distribution to the noise prediction"

How ever, in the code, I find that just the last layer feature of the encoder is used for cross attention. As the [-1] means:
pose_out = self.cros_attn2(x = xt_feats[-1], cond = pose_feats[-1]).mean([2,3])

Could you please briefly tell me where is the implementation of "multi-scale" feature for cross attention?

Well, I think the actual main model is class "BeatGANsAutoencModel" instead of class "BeatGANsPoseGuideModel". And the multiscale condition feature is saved in variable "enc_cond_emb" "mid_cond_emb" and "dec_cond_emb". Is it right?