Optimizer compilation fails with PyTorch 2.2
rosario-purple opened this issue · 2 comments
rosario-purple commented
What's the issue, what's expected?:
I tried to compile the MS-AMP optimizer with the new Torch 2.2:
cd msamp/optim
pip install -v .
but got this error:
File "/scratch/brr/MS-AMP/msamp/optim/setup.py", line 7, in <module>
from torch.utils import cpp_extension
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> See above for output.
How to reproduce it?:
Running this code in Python reproduces the error:
>>> import torch
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/__init__.py", line 237, in <module>
from torch._C import * # noqa: F403
ImportError: /scratch/miniconda3/envs/brr/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so: undefined symbol: ncclCommRegister
Log message or shapshot?:
See above
Additional information:
My best guess is that this is caused by MS-AMP being pinned to an external old version of libnccl (2.17.1), while PyTorch 2.2 seems to depend on a newer version (2.19.3).
tocean commented
We haven't test MS-AMP with pytorch 22. Currently we only support pytorch1.14 and 2.1. And it is recommended to use our docker image or nvcr.io/nvidia/pytorch:23.10-py3. And we have plan to upgrade msccl to latest version.
tocean commented
Can you share me the complete steps of reproducing this issue?