segment_matmul failing on CUDA

Question

segment_matmul failing on CUDA

Closed this issue 6 months ago · 7 comments

🐛 Describe the bug

I was trying to reproduce the minimal example listed in the documentation for segment_matmul. However, I've found that while it works on the CPU, it fails on the GPU with a RuntimeError. Here's the code I've used:

# segmat.py
from argparse import ArgumentParser

from torch_geometric.typing import pyg_lib
import torch

parser = ArgumentParser()
parser.add_argument('device', type=str)
args = parser.parse_args()

device = torch.device(args.device)
inputs = torch.randn(8, 16, device=device)
ptr = torch.tensor([0, 5, 8], device=device)
other = torch.randn(2, 16, 32, device=device)

out = pyg_lib.ops.segment_matmul(inputs, ptr, other)

Running

python segmat.py cpu

works, but

python segmat.py cuda

throws the following:

Traceback (most recent call last):
  File "/home/daniel/Drive/VU/projects/2023-06-09-exigraph/exigraph/segmat.py", line 15, in <module>
    out = pyg_lib.ops.segment_matmul(inputs, ptr, other)
  File "/home/daniel/miniconda3/envs/exigraph/lib/python3.10/site-packages/pyg_lib/ops/__init__.py", line 95, in segment_matmul
    out = torch.ops.pyg.segment_matmul(inputs, ptr, other)
  File "/home/daniel/miniconda3/envs/exigraph/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: GroupedGEMM run failed

Environment

pyg-lib version: 0.2.0+pt20cu117
PyTorch version: 2.0.1
OS: Ubuntu 22.04.2 LTS
Python version: 3.10.11
CUDA/cuDNN version: 11.7
How you installed PyTorch and pyg-lib (conda, pip, source): pip, using

pip install pyg-lib -f https://data.pyg.org/whl/torch-2.0.0+cu117.html

Answer 1 · 2023-08-04T12:15:07.000Z

Thanks @dfdazac. Which GPU are you using? @puririshi98 is this a known issue?

Answer 2 · 2023-08-04T15:07:28.000Z

I'm using a GTX 1650 Mobile, on my work laptop. However, after your question I tried updating the driver, then also running it on other machines, and here's what I got:

GPU	Driver	Works?
GTX 1650 Mobile	470.199.02	❌
GTX 1650 Mobile	535.86.05	❌
GeForce GTX 1080	470.57.02	❌
RTX A3000	528.89	✔️
A100 40GB	520.61.05	✔️

It looks like an issue with older cards.

Answer 3 · 2023-08-10T18:19:21.000Z

I tested our latest NVIDIA container on

nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-89a48638-3d6b-10ae-38b1-01f03b50d9e8)

and cannot reproduce the error. i will see if older versions of cuda/pyg etc can repro it. i went as far back as nvidia 2023 April container, still not reproducing the error, will continue looking further back when i find the time

Answer 4 · 2023-08-10T23:08:58.000Z

when i tried to build older versions i hit some build issues. @dfdazac is it possible for you to try our containers and see if they work for you? this is turning out to be very difficult to reproduce on my end:
https://developer.nvidia.com/pyg-container-early-access

Answer 5 · 2023-09-04T09:39:00.000Z

This seems to be a problem with our nightly builds. Installing from source works for me. @dfdazac Just to confirm: You are using the provided nightly build?

Answer 6 · 2023-09-04T09:53:58.000Z

@puririshi98 Can you check whether the wheels work on your end?

Answer 7 · 2024-06-12T22:16:20.000Z

following up on this, I was not able to reproduce the error with any similar hardware. I will close the issue for now, suspecting it may have been a seperate setup issue on @dfdazac computer. Feel free to re-open if issue persists with the latest wheels or source builds or nvidia container?
container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg
(i recommend the container for simplest and latest setup).