aws-neuron/aws-neuron-sdk

Error: "Backward sending grads, but get None"

wfckl789 opened this issue · 1 comments

Hi, I'm encountering an error Backward sending grads, but get None raised by the bwd_postprocess_task() during the model training. It seems that tensor will lose its requires_grad property after passing into this source code tensor_recv_next = xm.all_reduce(xm.REDUCE_SUM, tensor_recv_next, groups=groups) in src/neuronx_distributed/pipeline/comm.py.

This error also happens when I tried the demo Training Llama-2-13B/70B with Tensor Parallelism and Pipeline Parallelism (neuronx-distributed ) provided by the neuron document.

This is the log and compiler info:
simple.log

`2024-04-01 06:59:57.748428: W torch_xla/csrc/lowering_context.cpp:71] No custom opname metadata! op_type=xla___op_TransferWithStaticRingTransfer
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
    self._exec_instr()
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
    raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_8_pp1_tp0_dp0] Backward sending grads, but get None
Traceback (most recent call last):
  File "run_simple_model_nxd.py", line 289, in <module>
    _mp_fn(0, args)
  File "run_simple_model_nxd.py", line 225, in _mp_fn
    train_simple_model(args)
  File "run_simple_model_nxd.py", line 188, in train_simple_model
    loss = model.run_train(
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/trainer/model.py", line 25, in run_train
    return self.module.run_train(*args, **kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 542, in run_train
    loss = self._run_train(**kwargs)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 561, in _run_train
    self._exec_schedule(self.train_scheduler)
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
    self._exec_instr()
  File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
    raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_24_pp3_tp0_dp0] Backward sending grads, but get None`

Package version:
image
image
image

Other system details:
instance: Trn1
OS: Ubuntu 20.04

If you need other information, please let me konw. Thanks.

I am closing this issue in favor of:
aws-neuron/neuronx-distributed#19
We will track that ticket until resolution instead.