question about attention coefficients

Question

question about attention coefficients

Closed this issue a year ago · 2 comments

InvincibleWza1999 commented a year ago

Hello and thank you for your work and code. But I have the following question:
(1)in graph_layer.py, we have:
alpha = softmax(alpha, edge_index_i, size_i)
since we have equation(8) in your paper, the attention coefficients should be normalized through softmax for each node i and all its neighour nodes j where there exists an edge i-->j.
However, in the code mentioned above, I think the attention coefficients are normalized through softmax for each node i and all its neighour nodes j where there exists an edge j-->i, since according to torch_geometric document, edge_index_i stands for the index of central node which absorbs information, and x_j is source node, and we finally get:
return x_j * alpha.view(-1, self.heads, 1)
which stands for the information vector of each source node that should be passed to central node.
(2) In you paper, for the output layer, we have:
use the results across all nodes as the input of stacked fully-connected layers with output dimensionality N, to predict the vector of sensor values at time step t.
However in your code, I think the output layer in form of MLP is shared by each node, which has the output dimensionality 1. Is this inconsistent with the content in your paper, or am I misunderstanding?
I will be grateful for your reply.

Answer 1 · 2023-03-08T07:28:23.000Z

Thanks for your interest.

In the paper, the neighbor nodes are defined based on the edge existence from j to i, which refers to the definition of \mathcal{N}(i) (below eq. (5)) and A_{ji} (see eq. (3)). So the paper and the implementation should be consistent.
The paper means that assume we now have the embedding outputs from previous layers with shape N * D, where N is the number of sensors and D is the embedding size, which can also be taken as the input for the eq. (9). After passing through the MLP, we could obtain output with shape N * 1 which represents N sensors' prediction. The output dimensionality N here is to emphasize the output is corresponding to the predictions of all N sensors.

Hope it clarifies :)

Answer 2 · 2023-03-08T16:47:50.000Z

Thank you for your kind reply!