All reduce error when training a model with multiple GPU nodes.

Question

All reduce error when training a model with multiple GPU nodes.

ZhuJiaqi9905 opened this issue 7 months ago · 4 comments

I use 4 nodes (1 GPU per node) to train the gpt model, and the pipeline template is best execution plan: 2 x <oobleck.PipelineTemplate.2nodes> pipelines (b: 6) . However, sometimes it will block when the first time it does all reduce and throw an error later. (About half of the time it will throw an error and the rest of the time it will train normally)

This is my training log:

[2024-04-06 08:22:23,544] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
mb size: 62, ar across: 62, ar in: 62
Loading done. creating layer execution results...
Returning from get_profiler_results
[2024-04-06 08:22:26,733] [INFO] [agent.py:154:_launch_workers] Job arguments: OobleckArguments(dist=DistributedArguments(master_ip='172.21.0.42', master_port=60000, node_ips=['172.21.0.42', '172.21.0.46', '172.21.0.90', '172.21.0.91'], node_port=2222, num_workers=1, num_agents_per_node=1, username='root'), job=JobArguments(fault_threshold=1, microbatch_size=8, global_microbatch_size=96, steps=100), model=ModelArguments(model_name='gpt2', model_tag='medium', dataset_path='wikitext', dataset_name='wikitext-2-raw-v1', model_args={'n_head': 48, 'num_hidden_layers': 60}))
[2024-04-06 08:22:26,734] [INFO] [agent.py:184:_launch_workers] in agent. my_ip 172.21.0.91, node_ips ['172.21.0.42', '172.21.0.46', '172.21.0.90', '172.21.0.91']
[2024-04-06 08:22:26,734] [INFO] [agent.py:186:_launch_workers] Launching worker 0...
[2024-04-06 08:22:28,678] [INFO] [real_accelerator.py:158:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-04-06 08:22:31,772] [INFO] [worker.py:23:worker_main] Initializing Oobleck Engine...
[2024-04-06 08:22:31,772] [INFO] [worker.py:24:worker_main] in worker main: my_ip 172.21.0.91
/root/miniconda3/envs/oobleck/lib/python3.10/site-packages/transformers/training_args.py:1281: FutureWarning: using `no_cuda` is deprecated and will be removed in version 5.0 of 🤗 Transformers. Use `use_cpu` instead
  warnings.warn(
[2024-04-06 08:22:42,588] [INFO] [engine.py:502:_initialize_engine] model arguments: {'n_head': 48, 'num_hidden_layers': 60}
mb size: 62, ar across: 62, ar in: 62
Loading done. creating layer execution results...
Returning from get_profiler_results
[2024-04-06 08:22:45,365] [INFO] [engine.py:542:_initialize_engine] Number of nodes range: (2, 4)
Creating tasks for 2 nodes
Creating tasks for 3 nodes
Creating tasks for 4 nodes
Waiting for tasks for 2 nodes
Wait done
Cache hit: 647, miss: 8178
StageExecutionResult[0:30] with 1 devices
StageExecutionResult[31:61] with 1 devices
Waiting for tasks for 3 nodes
Wait done
Cache hit: 2492, miss: 114694
StageExecutionResult[0:21] with 1 devices
StageExecutionResult[22:42] with 1 devices
StageExecutionResult[43:61] with 1 devices
Waiting for tasks for 4 nodes
[2024-04-06 08:22:46,262] [DEBUG] [agent.py:289:on_receive_response] Receiving: (<Response.FORWARD_RANK0_PORT: 5>, <RequestType.UNDEFINED: 0>)
[2024-04-06 08:22:46,262] [DEBUG] [agent.py:236:on_receive_worker_port] agent recv TCP Store port 36047 from master
Wait done
Cache hit: 3193, miss: 222711
StageExecutionResult[0:15] with 1 devices
StageExecutionResult[16:31] with 1 devices
StageExecutionResult[32:46] with 1 devices
StageExecutionResult[47:61] with 1 devices
[2024-04-06 08:22:46,807] [INFO] [worker.py:26:worker_main] Initializing torch.distributed...
[2024-04-06 08:22:46,807] [DEBUG] [engine.py:580:initialize_distributed] regenerate rank_map {'172.21.0.42': [0], '172.21.0.46': [1], '172.21.0.90': [2], '172.21.0.91': [3]}
[2024-04-06 08:22:46,807] [INFO] [engine.py:591:initialize_distributed] init pg: rank 3, world_size: 4, rank_map: {'172.21.0.42': [0], '172.21.0.46': [1], '172.21.0.90': [2], '172.21.0.91': [3]}
[2024-04-06 08:22:46,807] [INFO] [engine.py:605:initialize_distributed] Waiting for a port information...
[2024-04-06 08:22:46,807] [INFO] [engine.py:608:initialize_distributed] Received torch master: 172.21.0.42.36047
[2024-04-06 08:22:47,959] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-04-06 08:22:47,959] [INFO] [engine.py:625:initialize_distributed] [rank: 3] Distributed initialization is done.
[2024-04-06 08:22:47,959] [INFO] [worker.py:33:worker_main] Instantiating pipelines...
[2024-04-06 08:22:48,178] [INFO] [instantiator.py:199:get_best_execution_plan] Best execution plan: 2 x <oobleck.PipelineTemplate.2nodes> pipelines (b: 6)
B: 12
[2024-04-06 08:22:54,587] [INFO] [engine.py:64:_reconfiguration_listener_fn] ReconfigureEngine: start reconfigure listening
[2024-04-06 08:22:54,588] [INFO] [worker.py:35:worker_main] Begin training...
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/root/miniconda3/envs/oobleck/lib/python3.10/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/root/miniconda3/envs/oobleck/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/workspace/Oobleck/oobleck/elastic/worker.py", line 36, in worker_main
    engine.train()
  File "/workspace/Oobleck/oobleck/execution/engine.py", line 715, in train
    self._train_step()
  File "/workspace/Oobleck/oobleck/utils/timer.py", line 15, in wrapper
    result = func(s, *args, **kwargs)
  File "/workspace/Oobleck/oobleck/execution/engine.py", line 677, in _train_step
    self._dp_engine.do_allreduce()
  File "/workspace/Oobleck/oobleck/execution/engine.py", line 433, in do_allreduce
    layer.reduce_gradients(process_groups)
  File "/workspace/Oobleck/oobleck/execution/layer.py", line 296, in reduce_gradients
    torch.distributed.all_reduce(tensor=grad, group=process_group)
  File "/root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1436, in wrapper
    return func(*args, **kwargs)
  File "/root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 1687, in all_reduce
    work = group.allreduce([tensor], opts)
RuntimeError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from recvBytes at /opt/conda/conda-bld/pytorch_1678411187366/work/torch/csrc/distributed/c10d/Utils.hpp:604 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f21e191c4d7 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f21e18e6434 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10d::TCPStore::doWait(c10::ArrayRef<std::string>, std::chrono::duration<long, std::ratio<1l, 1000l> >) + 0xd8 (0x7f2225153248 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::doGet(std::string const&) + 0x22 (0x7f2225153ef2 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::TCPStore::get(std::string const&) + 0x59 (0x7f2225153f79 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2225113441 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2225113441 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7f2225113441 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xaf (0x7f21e28de6ff in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x201 (0x7f21e28e23b1 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0xf034dd (0x7f21e28e94dd in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x21 (0x7f21e28ea8e1 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x39d (0x7f21e28ed59d in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #13: <unknown function> + 0x501d534 (0x7f2225108534 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x502009f (0x7f222510b09f in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x50399b3 (0x7f22251249b3 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0xb6d581 (0x7f222ec8a581 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0x3b7265 (0x7f222e4d4265 in /root/miniconda3/envs/oobleck/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x14bb15 (0x558a95093b15 in /root/miniconda3/envs/oobleck/bin/python)
frame #19: _PyObject_MakeTpCall + 0x152 (0x558a9508aa62 in /root/miniconda3/envs/oobleck/bin/python)
frame #20: <unknown function> + 0xe446c (0x558a9502c46c in /root/miniconda3/envs/oobleck/bin/python)
frame #21: _PyEval_EvalFrameDefault + 0x49c9 (0x558a95130439 in /root/miniconda3/envs/oobleck/bin/python)
frame #22: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #23: PyObject_Call + 0xb8 (0x558a95091028 in /root/miniconda3/envs/oobleck/bin/python)
frame #24: _PyEval_EvalFrameDefault + 0x2c0b (0x558a9512e67b in /root/miniconda3/envs/oobleck/bin/python)
frame #25: _PyFunction_Vectorcall + 0x798 (0x558a9510e6f8 in /root/miniconda3/envs/oobleck/bin/python)
frame #26: _PyEval_EvalFrameDefault + 0x125a (0x558a9512ccca in /root/miniconda3/envs/oobleck/bin/python)
frame #27: <unknown function> + 0x1c73a5 (0x558a9510f3a5 in /root/miniconda3/envs/oobleck/bin/python)
frame #28: _PyEval_EvalFrameDefault + 0x49c9 (0x558a95130439 in /root/miniconda3/envs/oobleck/bin/python)
frame #29: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #30: _PyEval_EvalFrameDefault + 0x60b (0x558a9512c07b in /root/miniconda3/envs/oobleck/bin/python)
frame #31: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #32: PyObject_Call + 0x1aa (0x558a9509111a in /root/miniconda3/envs/oobleck/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x2c0b (0x558a9512e67b in /root/miniconda3/envs/oobleck/bin/python)
frame #34: _PyFunction_Vectorcall + 0x9eb (0x558a9510e94b in /root/miniconda3/envs/oobleck/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x60b (0x558a9512c07b in /root/miniconda3/envs/oobleck/bin/python)
frame #36: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x60b (0x558a9512c07b in /root/miniconda3/envs/oobleck/bin/python)
frame #38: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #39: PyObject_Call + 0x1aa (0x558a9509111a in /root/miniconda3/envs/oobleck/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x2c0b (0x558a9512e67b in /root/miniconda3/envs/oobleck/bin/python)
frame #41: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x60b (0x558a9512c07b in /root/miniconda3/envs/oobleck/bin/python)
frame #43: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x60b (0x558a9512c07b in /root/miniconda3/envs/oobleck/bin/python)
frame #45: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x304 (0x558a9512bd74 in /root/miniconda3/envs/oobleck/bin/python)
frame #47: _PyFunction_Vectorcall + 0x25d (0x558a9510e1bd in /root/miniconda3/envs/oobleck/bin/python)
frame #48: _PyEval_EvalFrameDefault + 0x125a (0x558a9512ccca in /root/miniconda3/envs/oobleck/bin/python)
frame #49: <unknown function> + 0x1c51b9 (0x558a9510d1b9 in /root/miniconda3/envs/oobleck/bin/python)
frame #50: PyEval_EvalCode + 0x87 (0x558a951c0f67 in /root/miniconda3/envs/oobleck/bin/python)
frame #51: <unknown function> + 0x279029 (0x558a951c1029 in /root/miniconda3/envs/oobleck/bin/python)
frame #52: <unknown function> + 0x29ec94 (0x558a951e6c94 in /root/miniconda3/envs/oobleck/bin/python)
frame #53: PyRun_StringFlags + 0x7d (0x558a951edeed in /root/miniconda3/envs/oobleck/bin/python)
frame #54: PyRun_SimpleStringFlags + 0x3d (0x558a951edf4d in /root/miniconda3/envs/oobleck/bin/python)
frame #55: Py_RunMain + 0x26c (0x558a951ee1ec in /root/miniconda3/envs/oobleck/bin/python)
frame #56: Py_BytesMain + 0x39 (0x558a951ee469 in /root/miniconda3/envs/oobleck/bin/python)
frame #57: __libc_start_main + 0xe7 (0x7f2306953bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #58: <unknown function> + 0x2112d1 (0x558a951592d1 in /root/miniconda3/envs/oobleck/bin/python)
. This may indicate a possible application crash on rank 0 or a network set up issue.

It seems not a network connection problem, as it sometimes works and sometimes doesn't. Can you provide any guidance on how to address it? Any help would be appreciated. Many thanks!

Answer 1 · 2024-04-10T05:44:00.000Z

I find that different machines run function create_pipeline_templates() with the same configuration may generate different pipeline stages, and the same machine run function create_pipeline_templates() multipile times with the same configuration may also generate different pipeline stages.

For example, the first machine run function create_pipeline_templates() generates:

pl 0: rank_grid: {0: [0], 1: [0], 2: [0], 3: [0], 4: [0], 5: [0], 6: [0], 7: [0], 8: [0], 9: [0], 10: [0], 11: [0], 12: [0], 13: [0], 14: [0], 15: [0], 16: [0], 17: [0], 18: [0], 19: [0], 20: [0], 21: [0], 22: [0], 23: [0], 24: [0], 25: [0], 26: [0], 27: [0], 28: [0], 29: [0], 30: [0], 31: [0], 32: [1], 33: [1], 34: [1], 35: [1], 36: [1], 37: [1], 38: [1], 39: [1], 40: [1], 41: [1], 42: [1], 43: [1], 44: [1], 45: [1], 46: [1], 47: [1], 48: [1], 49: [1], 50: [1], 51: [1], 52: [1], 53: [1], 54: [1], 55: [1], 56: [1], 57: [1], 58: [1], 59: [1], 60: [1], 61: [1]}, num_nodes: 2, num_gpu_per_node: 1, iter: 48419.43395149708
stage 0: layer_indices [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], mem: 14437924864, num_gpus: 1
stage 1: layer_indices [32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61], mem: 16837231616, num_gpus: 1

while the second machine generates

pl 0: rank_grid: {0: [0], 1: [0], 2: [0], 3: [0], 4: [0], 5: [0], 6: [0], 7: [0], 8: [0], 9: [0], 10: [0], 11: [0], 12: [0], 13: [0], 14: [0], 15: [0], 16: [0], 17: [0], 18: [0], 19: [0], 20: [0], 21: [0], 22: [0], 23: [0], 24: [0], 25: [0], 26: [0], 27: [1], 28: [1], 29: [1], 30: [1], 31: [1], 32: [1], 33: [1], 34: [1], 35: [1], 36: [1], 37: [1], 38: [1], 39: [1], 40: [1], 41: [1], 42: [1], 43: [1], 44: [1], 45: [1], 46: [1], 47: [1], 48: [1], 49: [1], 50: [1], 51: [1], 52: [1], 53: [1], 54: [1], 55: [1], 56: [1], 57: [1], 58: [1], 59: [1], 60: [1], 61: [1]}, num_nodes: 2, num_gpu_per_node: 1, iter: 53851.282695651054
stage 0: layer_indices [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26], mem: 12422262784, num_gpus: 1
stage 1: layer_indices [27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61], mem: 18852893696, num_gpus: 1

That will cause the above training error. Could you please tell me why this happens and how to address it? Thanks!

Answer 2 · 2024-04-10T06:06:35.000Z

This will be addressed with a newly refactored pipeline template generation algorithm. I am finalizing all refactoring works and preparing to release it. Please give me a few more days.

Answer 3 · 2024-04-11T00:35:57.000Z

Re: create_pipeline_templates issues, I believe it should be resolved now. Re: NCCL error, Oobleck now has more advanced error detecting implementation. Although not all abrupt disconnection cannot be detected, most can be handled especially when you use Oobleck's new feature that sends a request of graceful termination. Please try it, and feel free to reopen this issue or recreate a new one if you face another problem. Thanks.

Answer 4 · 2024-05-07T09:24:19.000Z

Thanks, that works.