What is the principle of exchanging the first two dimensions when calculating QKV attention?

Question

What is the principle of exchanging the first two dimensions when calculating QKV attention?

WithMeteor opened this issue 2 years ago · 1 comments

When reading the source code of NodeFormer, I found that when calculating QKV attention, the first and second dimensions of query/key/value were exchanged, such as lines 169-171 of nodeformer.py. After calculating attention, the first two dimensions were exchanged again when performing normalization.
At first, I thought this work was unnecessary until I commented out the code and discovered a program memory overflow. Therefore, I am very curious about the principle of this step.
Does placing the node_number in the second dimension affect the complexity of matrix multiplication when calculating the dot product of key and value? Therefore, the node_number was placed in the first dimension in advance.

Answer 1 · 2023-07-04T11:05:19.000Z

I may know the cause of the problem. When calculating the weight of the Adjacency Matrix, the dimension of the slice needs to be adjusted when obtaining the query_end and key_start. Change query_prime[end] to query_prime[:, end] and key_prime[start] to key_prime[:, start] in line 143 and 188. Change attn_normalizer[end] to attn_normalizer[:, end] in line 147 and 192 will solve the problem.
This issue will be closed.