potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69
junpuf opened this issue · 1 comments
Hi there,
While benchmarking distributed training with Metaseq OPT using multiple p4d.24xlarge
, we discovered an issue where the training processes launched by slurm using "opt-baselines" launcher running into "OSError: [Errno 12] Cannot allocate memory" with PyTorch Dataloader.
Traceback (most recent call last):
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 793, in <module>
cli_main()
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 789, in cli_main
distributed_utils.call_main(cfg, main)
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 289, in call_main
return distributed_main(
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 227, in distributed_main
retval = main(cfg, **kwargs)
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 190, in main
valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/contextlib.py", line 75, in inner
return func(*args, **kwds)
File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 339, in train
samples = next(progress_iter)
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/logging/progress_bar/json_progress_bar.py", line 38, in __iter__
for i, obj in enumerate(self.iterable, start=self.n):
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 62, in __iter__
for x in self.iterable:
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 851, in __next__
raise item
File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 782, in run
for item in self._source:
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
return self._get_iterator()
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
w.start()
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
return Popen(process_obj)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory
After debugging, we found 2 ways to avoid the above error.
1: unset FI_EFA_USE_DEVICE_RDMA
before launch training
2: adjusting --num-worker
to 0
,1
, or 2
from default 8
.
The resolution makes us believe that it might be the same issue as #69 .
System Info:
PyTorch: 1.13.1
NVIDIA Driver: 525.85.12
CUDA: 11.7
NCCL: 2.16.2 inc_nsteps
EFA Installer: 1.21.0
AWS OFI NCCL: 1.5.0-aws
As discussed internally, the 2 issues aren't related and this is mostly a GPU running out of memory. Therefore, removing RDMA flag helps, as NCCL would use host buffers for network transfers rather than CUDA memory.