pbloem/former

Question about k × hk weight matrices

ramesaliyev opened this issue · 1 comments

Hi, first of all, thanks for this great explanation.

My question is, in the blog post under the section In Pytorch: complete self-attention it says

but it's actually more efficient to combine these for all heads into three single k×hk matrices

Shouldn't it be three hk × k matrices? Since this is what weights of Linear layers looks like?
Thanks

pbloem commented

For the tutorial I went with column vectors for the instances and thus output_size x input_size weight matrices. This is what you see in the figures, so it makes more sense to write kxhk.

It's possible that the Linear layer stores the weights asinput_size x output_size. That would be correct for pre-multiplying a row vector (but it would be inconsistent with the figures in the blog post).