some questions about scaling

Line 869 in 00e07b7

def forward(self, x, mask = None, attn_bias = None):

It seems we need to scale up Q and K when using cosine sim. But what is the reason for scaling Q before applying rotary emb?