Vector of learned coefficients for the attention mechanism and question about edge_index

Hello, first of all thank you very much for making this amazing work public! The idea and the results are really impressive. So here are my questions:

Is the formula (7) of the paper right? As far as i understand: alpha = (key_i * cat_att_i).sum(-1) + (key_j * cat_att_j).sum(-1) in line 103 of the graph_layer doesn't reflect formula (7) (without LeakyReLU).
Is there an Idea of creating the edge_index with TopK inside the GDN module and not outside with a seperate function?

Thanks for your interest, too.

The activation functions is performed below the line 103:

GDN/models/graph_layer.py

Line 109 in 9853899

alpha = F.leaky_relu(alpha, self.negative_slope)
I don't really get the question here. If you are pointing to the code implementation, then the TopK selection is in the GDN module:

GDN/models/GDN.py

Line 161 in 9853899

gated_i = torch.arange(0, node_num).T.unsqueeze(1).repeat(1, topk_num).flatten().to(device).unsqueeze(0)

. If not, could you elaborate more details here? Thanks.

Hello, I appreciate your fast answers. Sorry if i was unclear i'll try to reformulate my questions.

If i compare the paper with the code i understand:
(a^T(g_i^(t)⊕g_j^(t)) = alpha = (key_i * cat_att_i).sum(-1) + (key_j * cat_att_j).sum(-1).
I was just questioning if this mathematical formulation (formula 7) is correct, since i not really see it, or if the mathematical formula of the paper would be slightly incorrect?
This question was more about the idea, why the topK was created inside the GDN. I'm not sure but i think that it would make a little bit more sense to return the embeddings from the GDN model at each epoch and then calculate a new edge index out of them and then feed them again into the GDN. But don't worry, it is really hard to formulate and I was just curious :). The first question is far more important to me.

Greetings :)

The implementation here is just decomposing the linear combination into two parts, instead of concatenating first and linear combination in formula (7), we split them into g_i(key_i) and g_j(key_j), and do linear combination for each part and combine them together. But the implementation here is just the same meaning as formula (7), still doing linear combination.
If I understand correctly, I think you indicate that the edge index should be built on the output embedding of GDN and then use the new edge_index for final embedding computation. It sounds like this is two-layer GDN, but the first layer is using the current edge_index and the second-layer use the new edge_index based on the result of first-layer's output. I think it is plausible intuitively :).

Alright, this answers my questions. Thanks alot for taking the time!