pyg-team/pyg-lib

segment_matmul failing on CUDA

Closed this issue ยท 7 comments

๐Ÿ› Describe the bug

I was trying to reproduce the minimal example listed in the documentation for segment_matmul. However, I've found that while it works on the CPU, it fails on the GPU with a RuntimeError. Here's the code I've used:

# segmat.py
from argparse import ArgumentParser

from torch_geometric.typing import pyg_lib
import torch

parser = ArgumentParser()
parser.add_argument('device', type=str)
args = parser.parse_args()

device = torch.device(args.device)
inputs = torch.randn(8, 16, device=device)
ptr = torch.tensor([0, 5, 8], device=device)
other = torch.randn(2, 16, 32, device=device)

out = pyg_lib.ops.segment_matmul(inputs, ptr, other)

Running

python segmat.py cpu

works, but

python segmat.py cuda

throws the following:

Traceback (most recent call last):
  File "/home/daniel/Drive/VU/projects/2023-06-09-exigraph/exigraph/segmat.py", line 15, in <module>
    out = pyg_lib.ops.segment_matmul(inputs, ptr, other)
  File "/home/daniel/miniconda3/envs/exigraph/lib/python3.10/site-packages/pyg_lib/ops/__init__.py", line 95, in segment_matmul
    out = torch.ops.pyg.segment_matmul(inputs, ptr, other)
  File "/home/daniel/miniconda3/envs/exigraph/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
    return self._op(*args, **kwargs or {})
RuntimeError: GroupedGEMM run failed

Environment

  • pyg-lib version: 0.2.0+pt20cu117
  • PyTorch version: 2.0.1
  • OS: Ubuntu 22.04.2 LTS
  • Python version: 3.10.11
  • CUDA/cuDNN version: 11.7
  • How you installed PyTorch and pyg-lib (conda, pip, source): pip, using
pip install pyg-lib -f https://data.pyg.org/whl/torch-2.0.0+cu117.html

Thanks @dfdazac. Which GPU are you using? @puririshi98 is this a known issue?

I'm using a GTX 1650 Mobile, on my work laptop. However, after your question I tried updating the driver, then also running it on other machines, and here's what I got:

GPU Driver Works?
GTX 1650 Mobile 470.199.02 โŒ
GTX 1650 Mobile 535.86.05 โŒ
GeForce GTX 1080 470.57.02 โŒ
RTX A3000 528.89 โœ”๏ธ
A100 40GB 520.61.05 โœ”๏ธ

It looks like an issue with older cards.

I tested our latest NVIDIA container on

nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-89a48638-3d6b-10ae-38b1-01f03b50d9e8)

and cannot reproduce the error. i will see if older versions of cuda/pyg etc can repro it. i went as far back as nvidia 2023 April container, still not reproducing the error, will continue looking further back when i find the time

when i tried to build older versions i hit some build issues. @dfdazac is it possible for you to try our containers and see if they work for you? this is turning out to be very difficult to reproduce on my end:
https://developer.nvidia.com/pyg-container-early-access

This seems to be a problem with our nightly builds. Installing from source works for me. @dfdazac Just to confirm: You are using the provided nightly build?

@puririshi98 Can you check whether the wheels work on your end?

following up on this, I was not able to reproduce the error with any similar hardware. I will close the issue for now, suspecting it may have been a seperate setup issue on @dfdazac computer. Feel free to re-open if issue persists with the latest wheels or source builds or nvidia container?
container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg
(i recommend the container for simplest and latest setup).