Question about k × hk weight matrices
ramesaliyev opened this issue · 1 comments
Hi, first of all, thanks for this great explanation.
My question is, in the blog post under the section In Pytorch: complete self-attention it says
but it's actually more efficient to combine these for all heads into three single k×hk matrices
Shouldn't it be three hk × k
matrices? Since this is what weights of Linear layers looks like?
Thanks
For the tutorial I went with column vectors for the instances and thus output_size x input_size
weight matrices. This is what you see in the figures, so it makes more sense to write kxhk.
It's possible that the Linear layer stores the weights asinput_size x output_size
. That would be correct for pre-multiplying a row vector (but it would be inconsistent with the figures in the blog post).