lucidrains/DALLE2-pytorch

some questions about scaling

miganchuanbo opened this issue · 0 comments

def forward(self, x, mask = None, attn_bias = None):

It seems we need to scale up Q and K when using cosine sim. But what is the reason for scaling Q before applying rotary emb?