aws/aws-ofi-nccl

potential reoccurrence of https://github.com/aws/aws-ofi-nccl/issues/69

junpuf opened this issue · 1 comments

Hi there,

While benchmarking distributed training with Metaseq OPT using multiple p4d.24xlarge, we discovered an issue where the training processes launched by slurm using "opt-baselines" launcher running into "OSError: [Errno 12] Cannot allocate memory" with PyTorch Dataloader.

Traceback (most recent call last):
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 793, in <module>
    cli_main()
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 789, in cli_main
    distributed_utils.call_main(cfg, main)
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 289, in call_main
    return distributed_main(
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/distributed/utils.py", line 227, in distributed_main
    retval = main(cfg, **kwargs)
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 190, in main
    valid_losses, should_stop = train(cfg, trainer, task, epoch_itr)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "./slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/cli/train.py", line 339, in train
    samples = next(progress_iter)
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/logging/progress_bar/json_progress_bar.py", line 38, in __iter__
    for i, obj in enumerate(self.iterable, start=self.n):
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 62, in __iter__
    for x in self.iterable:
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 851, in __next__
    raise item
  File "/fsx/aws-conda-benchmark-cluster-controller/jobs/4200_opt_4node_pt-1.13.1-cu117-py3.9_aws/aws-conda-benchmarks/autobench/opt/slurm_snapshot_code_oss/2023-02-27T06_32_55.055719/metaseq/data/iterators.py", line 782, in run
    for item in self._source:
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __iter__
    return self._get_iterator()
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 381, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1034, in __init__
    w.start()
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/fsx/conda/envs/autobench-opt-benchmark-aws-pytorch-1.13.1-cuda-11.7-python-3.8/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
OSError: [Errno 12] Cannot allocate memory

After debugging, we found 2 ways to avoid the above error.
1: unset FI_EFA_USE_DEVICE_RDMA before launch training
2: adjusting --num-worker to 0 ,1, or 2 from default 8.

The resolution makes us believe that it might be the same issue as #69 .

System Info:

PyTorch: 1.13.1
NVIDIA Driver: 525.85.12
CUDA: 11.7
NCCL: 2.16.2 inc_nsteps
EFA Installer: 1.21.0
AWS OFI NCCL: 1.5.0-aws

As discussed internally, the 2 issues aren't related and this is mostly a GPU running out of memory. Therefore, removing RDMA flag helps, as NCCL would use host buffers for network transfers rather than CUDA memory.