random_walk_cuda is causing an illegal memory access

Question

random_walk_cuda is causing an illegal memory access

ProfDoof opened this issue a year ago · 7 comments

Hi,

When running the following code, I get an illegal memory access error with the following graph. I am not sure why and do not understand the algorithm or C++ well enough to track it down. I do not get the error when I set device to 'cpu'.

I'm using the nightly build of pyg installed through a locally built conda package, and version 1.6.1 of PyTorch-cluster.

from torch_geometric.data import Data
from torch_geometric.utils import to_networkx
from networkx.drawing.nx_agraph import write_dot
import torch
new_node_ids = [x for x in range(7)]
sources = [
    0, 1, 2, 2, 4, 5,
]

targets = [
    1, 2, 3, 4, 5, 2,
]

data = Data(torch.tensor(new_node_ids), torch.tensor([sources, targets]))
data.num_nodes = 7

write_dot(to_networkx(data), 'test_test.dot')

device = 'cuda'
rowptr, col, perm = data.to(device).csr()
rowptr, col = rowptr[None], col[None]

print(rowptr, col)
start_indices = torch.arange(0, data.num_nodes, dtype=torch.long).flatten().to(device)

print(torch.ops.torch_cluster.random_walk(rowptr, col, start_indices,
                                 10, 2, 4))

EDIT:

Here's the error I get

tensor([0, 1, 2, 4, 4, 5, 6, 6], device='cuda:0') tensor([1, 2, 3, 4, 5, 2], device='cuda:0')
Traceback (most recent call last):
  File "/home/john/Research/EmbeddingGraphs/cfg2vec/gnn/test.py", line 25, in <module>
    print(torch.ops.torch_cluster.random_walk(rowptr, col, start_indices,
  File "/home/john/mambaforge/envs/gnn/lib/python3.9/site-packages/torch/_ops.py", line 503, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: CUDA error: an illegal memory access was encountered
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Answer 1 · 2023-05-12T05:14:49.000Z

This seems to be currently failing because node 6 is an isolated node, so data.num_nodes = 6 should fix this.

Answer 2 · 2023-05-12T08:22:39.000Z

This is a minimum example, the actual graph is more complicated and I can't remove the isolated nodes. Also, this doesn't fail for any other values of p or q. It also only happens in the CUDA version, not the CPU version. All that being said, I'm not sure what exactly is going on.

Answer 3 · 2023-05-13T01:59:10.000Z

@rusty1s just wanted to check if you had the chance to see this yet this evening.

Answer 4 · 2023-05-13T22:04:30.000Z

Will take a look soon.

Answer 5 · 2023-07-23T02:03:03.000Z

Wondering if there are any updates on this issue.

Answer 6 · 2023-07-24T06:14:40.000Z

Not yet, sorry for the delay.

Answer 7 · 2024-01-21T01:01:55.000Z

This issue had no activity for 6 months. It will be closed in 2 weeks unless there is some new activity. Is this issue already resolved?