qitianwu/NodeFormer

What is the principle of exchanging the first two dimensions when calculating QKV attention?

WithMeteor opened this issue · 1 comments

When reading the source code of NodeFormer, I found that when calculating QKV attention, the first and second dimensions of query/key/value were exchanged, such as lines 169-171 of nodeformer.py. After calculating attention, the first two dimensions were exchanged again when performing normalization.
At first, I thought this work was unnecessary until I commented out the code and discovered a program memory overflow. Therefore, I am very curious about the principle of this step.
Does placing the node_number in the second dimension affect the complexity of matrix multiplication when calculating the dot product of key and value? Therefore, the node_number was placed in the first dimension in advance.

I may know the cause of the problem. When calculating the weight of the Adjacency Matrix, the dimension of the slice needs to be adjusted when obtaining the query_end and key_start. Change query_prime[end] to query_prime[:, end] and key_prime[start] to key_prime[:, start] in line 143 and 188. Change attn_normalizer[end] to attn_normalizer[:, end] in line 147 and 192 will solve the problem.
This issue will be closed.