Question about k × hk weight matrices

Question

Question about k × hk weight matrices

ramesaliyev opened this issue 2 years ago · 1 comments

ramesaliyev commented 2 years ago

Hi, first of all, thanks for this great explanation.

My question is, in the blog post under the section In Pytorch: complete self-attention it says

but it's actually more efficient to combine these for all heads into three single k×hk matrices

Shouldn't it be three hk × k matrices? Since this is what weights of Linear layers looks like?
Thanks

Answer 1 · 2024-01-19T11:22:06.000Z

For the tutorial I went with column vectors for the instances and thus output_size x input_size weight matrices. This is what you see in the figures, so it makes more sense to write kxhk.

It's possible that the Linear layer stores the weights asinput_size x output_size. That would be correct for pre-multiplying a row vector (but it would be inconsistent with the figures in the blog post).