lucidrains/muse-maskgit-pytorch

Why scale used in Attention is 8 (while dim_head is 64)? If dim or dim_head are changed, should scale be changed automatically?

lqniunjunlper opened this issue · 1 comments

Why scale used in Attention is 8 (while dim_head is 64)? If dim or dim_head are changed, should scale be changed automatically?

in attend.py line #123
sim = einsum("b h i d, b h j d -> b h i j", q, k) * self.scale