segment_matmul failing on CUDA
Closed this issue ยท 7 comments
๐ Describe the bug
I was trying to reproduce the minimal example listed in the documentation for segment_matmul
. However, I've found that while it works on the CPU, it fails on the GPU with a RuntimeError. Here's the code I've used:
# segmat.py
from argparse import ArgumentParser
from torch_geometric.typing import pyg_lib
import torch
parser = ArgumentParser()
parser.add_argument('device', type=str)
args = parser.parse_args()
device = torch.device(args.device)
inputs = torch.randn(8, 16, device=device)
ptr = torch.tensor([0, 5, 8], device=device)
other = torch.randn(2, 16, 32, device=device)
out = pyg_lib.ops.segment_matmul(inputs, ptr, other)
Running
python segmat.py cpu
works, but
python segmat.py cuda
throws the following:
Traceback (most recent call last):
File "/home/daniel/Drive/VU/projects/2023-06-09-exigraph/exigraph/segmat.py", line 15, in <module>
out = pyg_lib.ops.segment_matmul(inputs, ptr, other)
File "/home/daniel/miniconda3/envs/exigraph/lib/python3.10/site-packages/pyg_lib/ops/__init__.py", line 95, in segment_matmul
out = torch.ops.pyg.segment_matmul(inputs, ptr, other)
File "/home/daniel/miniconda3/envs/exigraph/lib/python3.10/site-packages/torch/_ops.py", line 502, in __call__
return self._op(*args, **kwargs or {})
RuntimeError: GroupedGEMM run failed
Environment
pyg-lib
version: 0.2.0+pt20cu117- PyTorch version: 2.0.1
- OS: Ubuntu 22.04.2 LTS
- Python version: 3.10.11
- CUDA/cuDNN version: 11.7
- How you installed PyTorch and
pyg-lib
(conda
,pip
, source): pip, using
pip install pyg-lib -f https://data.pyg.org/whl/torch-2.0.0+cu117.html
Thanks @dfdazac. Which GPU are you using? @puririshi98 is this a known issue?
I'm using a GTX 1650 Mobile, on my work laptop. However, after your question I tried updating the driver, then also running it on other machines, and here's what I got:
GPU | Driver | Works? |
---|---|---|
GTX 1650 Mobile | 470.199.02 | โ |
GTX 1650 Mobile | 535.86.05 | โ |
GeForce GTX 1080 | 470.57.02 | โ |
RTX A3000 | 528.89 | โ๏ธ |
A100 40GB | 520.61.05 | โ๏ธ |
It looks like an issue with older cards.
I tested our latest NVIDIA container on
nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 Ti (UUID: GPU-89a48638-3d6b-10ae-38b1-01f03b50d9e8)
and cannot reproduce the error. i will see if older versions of cuda/pyg etc can repro it. i went as far back as nvidia 2023 April container, still not reproducing the error, will continue looking further back when i find the time
when i tried to build older versions i hit some build issues. @dfdazac is it possible for you to try our containers and see if they work for you? this is turning out to be very difficult to reproduce on my end:
https://developer.nvidia.com/pyg-container-early-access
This seems to be a problem with our nightly builds. Installing from source works for me. @dfdazac Just to confirm: You are using the provided nightly build?
@puririshi98 Can you check whether the wheels work on your end?
following up on this, I was not able to reproduce the error with any similar hardware. I will close the issue for now, suspecting it may have been a seperate setup issue on @dfdazac computer. Feel free to re-open if issue persists with the latest wheels or source builds or nvidia container?
container: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pyg
(i recommend the container for simplest and latest setup).