leffff/graphormer-pyg

Optimize for loop in SpatialEncoding

Closed this issue ยท 8 comments

leffff commented

class SpatialEncoding(nn.Module):

/opt/conda/conda-bld/pytorch_1670525553989/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [110,0,0], thread: [48,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/opt/conda/conda-bld/pytorch_1670525553989/work/aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [110,0,0], thread: [49,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

spatial_matrix = torch.zeros((x.shape[0], x.shape[0])).to(next(self.parameters()).device)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

When I change it to cpu, the error is that:
Traceback (most recent call last):
File "/home/lizheng/anaconda3/envs/nas/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/models.py", line 625, in forward
x = self.centrality_encoding(x, edge_index)
File "/home/lizheng/anaconda3/envs/nas/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
return forward_call(*input, **kwargs)
File "/home/layers.py", line 29, in forward
x += self.z_in[degree(index=edge_index[1], num_nodes=num_nodes).long()] +
IndexError: index 3 is out of bounds for dimension 0 with size 3

Thank you very much for providing the code.

I just find that if i used batch as input, the code will treat this batch as a large graph which will lead to the problem above.

x += self.z_in[degree(index=edge_index[1], num_nodes=num_nodes).long()] +
self.z_out[degree(index=edge_index[0], num_nodes=num_nodes).long()]

Databatch likes:
DataBatch(x=[14168, 6], edge_index=[2, 35352], edge_attr=[35352, 6], batch=[14168], ptr=[2025])

This code will not get correct edge_index like edge_index[1] = 10000, it exceeded the bounds of the array, so how can i solve this issue?

By the way, due to some reasons, I can only obtain the adjacency matrix and the vector matrix related to nodes, so I used the following code to write it as torch geometry

batch_size, num_nodes, _ = adj.shape
data_list = []
for batch_idx in range(batch_size):
adj_sp = scipy.sparse.csr_matrix(adj[batch_idx].numpy())
edge_index, edge_attr = from_scipy_sparse_matrix(adj_sp)
edge_attr = torch.ones(edge_attr.shape[0], 6)
data = Data(x = ops[batch_idx], edge_index = edge_index, edge_attr = edge_attr)
data_list.append(data)
batch = Batch.from_data_list(data_list)
data = batch.to(device)

leffff commented

Hi!
I didn't quite understand what you are trying to di and what is the issue, could you please provide more information?

PyG neural networks treat a batch of graphs as a large graph with shape [num_nodes_in_batch, num_node_features]. Therefore x only has 2 dimensions

lets dive into

x += self.z_in[degree(index=edge_index[1], num_nodes=num_nodes).long()] + 
self.z_out[degree(index=edge_index[0], num_nodes=num_nodes).long()]

edge index is a matrix [2, num_edges]. edge_index[0] is an array that contains "from" node indexes and edge_index[1] is an array that contains "to" node indexes.

So degree(index=edge_index[1], num_nodes=num_nodes).long() computes the "in" degrees of nodes and degree(index=edge_index[0], num_nodes=num_nodes).long() computes "out" degree of nodes

Thanks a lot for getting back to me!

Previously, I used GIN in my code, but I hope to use a graphormer instead of GIN for encoding the graph. Due to some reasons, I can't use the data from torch-geometric, the only one what I have are the adjacency matrix of some graphs and the edge encoded by a one-hot vector as each edge_ Attr standing for their types. Therefore, I need to transform the previous adjacency matrix and edge_ attr to torch_geometric data. So I written the following code:

for step, (adj, ops, _, _ ) in enumerate(train_loader): 
        batch_size, num_nodes, _ = adj.shape
        data_list = []
        for batch_idx in range(batch_size):
            adj_sp = scipy.sparse.csr_matrix(adj[batch_idx].numpy())
            edge_index, edge_attr = from_scipy_sparse_matrix(adj_sp)
            edge_attr = torch.ones(edge_attr.shape[0], 6)
            data = Data(x = ops[batch_idx], edge_index = edge_index, edge_attr = edge_attr)
            data_list.append(data)
        batch = Batch.from_data_list(data_list)
        data = batch.to(device)

However, torch-geometry's solution for batch is to define it as a large disconnected network. At this point, when the code runs to this line:

[x += self.z_in[degree(index=edge_index[1], num_nodes=num_nodes).long()] + 
self.z_out[degree(index=edge_index[0], num_nodes=num_nodes).long()]

and define the model as following hyperparameter:

self.encoder = Graphormer(num_layers = self.layer_num,
              input_node_dim = 6,
              node_dim = 6,
              input_edge_dim = 6,
              edge_dim = 6,
              output_dim = 3584,
              n_heads = 4,
              max_in_degree = 5,
              max_out_degree = 5,
              max_path_distance = 5,)

Due to the max_in_degree=5 and the edge_index[1] edge_index[0] are in the following format:

edge_index[0] - >   tensor([    0,     0,     0,  ..., 14166, 14167, 14167])
edge_index[1] - >   tensor([    1,     3,     4,  ..., 14167, 14164, 14166])

Which cloud lead to this issues that
x += self.z_in[degree(index=edge_index[1], num_nodes=num_nodes).long()] +
IndexError: index 5 is out of bounds for dimension 0 with size 5

So how can I fix this issues?

leffff commented

I may know, where the problem is.
I made a new commit be28c70 this may solve your problem. Please check

Sincerely thanks for receiving your reply, I just find that I set the wrong hyperparameter in the graphormer, your previous version was correct. I have 7 node in each graph, but I set max_in_degree = 5 which may have the above issues.

        self.encoder = Graphormer(num_layers = self.layer_num,
                                  input_node_dim = 6,
                                  node_dim = 6,
                                  input_edge_dim = 6,
                                  edge_dim = 6,
                                  output_dim = 3584,
                                  n_heads = 6,
                                  max_in_degree = 7,
                                  max_out_degree = 7,
                                  max_path_distance = 5,
                                  )

At present, for me, I input 1000 graphs as an epoch at once, each graph with 7 points. What I obtain from graphormer is the node vector(torch.Size([7000, 16])) for each node. Sometimes, I may pay more attention to the encoding information of each graph rather than the node information. Therefore, can we use the readout function in your code? Sincerely hopt that you can provide additional code for the readout function(like: global_add_pools, global_mean_pool, global_max_pool), allowing users to choose whether the graphformer ultimately obtains the encoding information about nodes or encoding information for each graph.

leffff commented

Thanks! that's a great idea!
I ll implement that! Grphormer has its own aggregation ([VNode]) that I'll implement in the near future.

Btw, can you share the results? Did Graphormer do better than the GIN?

Hi!
According to some recent papers, the performance of Grapher is better than GIN. These authors have all conducted validation on the PCQM4M dataset, but I am not sure if it has any improvement effect on my project and further experiments are needed to verify it. But i can currently provide some result from other peoples. Please check them.

WechatIMG60
https://arxiv.org/abs/2106.05234

WechatIMG61
https://arxiv.org/abs/2103.09430

WechatIMG62
https://arxiv.org/abs/2207.02505

leffff commented

Thanks! Good luck with your research!