mit-han-lab/distrifuser

I have a question about running code. There is an error when running the command torchrun --nproc_per_node=2 scripts/sdxl_example.py. My torch version is 2.2.1, cuda version is 11.8, and python version is 3.10.

CharvinMei opened this issue · 4 comments

[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
terminate called after throwing an instance of 'std::runtime_error'
what(): terminate called after throwing an instance of 'NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)std::runtime_error
'
what(): NCCL Error 5: invalid usage (run with NCCL_DEBUG=WARN for details)
[2024-06-12 03:45:51,447] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 10453) of binary: /root/anaconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures:
[1]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 10454)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10454

Root Cause (first observed failure):
[0]:
time : 2024-06-12_03:45:51
host : 692d3f5c0349
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 10453)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 10453

Looks like it is some torchrun and NCCL issue. Are you able to run it with a single GPU?

Yes, I can run it with a single GPU. But when it’s set up for two GPUs, an error occurs.

Weird. Could you try disabling the CUDAGraph to see if it works. You can simply pass use_cuda_graph=False here.

After changing the setting, I found that the error has become the following situation.

[rank0]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[rank1]:[W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
[2024-07-08 06:10:31,482] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 0 (pid: 99442) of binary: /home/meichangwang/miniconda3/envs/distrifuser/bin/python
Traceback (most recent call last):
File "/home/meichangwang/miniconda3/envs/distrifuser/bin/torchrun", line 8, in
sys.exit(main())
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/meichangwang/miniconda3/envs/distrifuser/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

scripts/sdxl_example.py FAILED

Failures:
[1]:
time : 2024-07-08_06:10:31
host : ubuntu
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 99443)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 99443

Root Cause (first observed failure):
[0]:
time : 2024-07-08_06:10:31
host : ubuntu
rank : 0 (local_rank: 0)
exitcode : -6 (pid: 99442)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 99442