Error: "Backward sending grads, but get None"
wfckl789 opened this issue · 1 comments
Hi, I'm encountering an error Backward sending grads, but get None
raised by the bwd_postprocess_task() during the model training. It seems that tensor will lose its requires_grad property after passing into this source code tensor_recv_next = xm.all_reduce(xm.REDUCE_SUM, tensor_recv_next, groups=groups)
in src/neuronx_distributed/pipeline/comm.py
.
This error also happens when I tried the demo Training Llama-2-13B/70B with Tensor Parallelism and Pipeline Parallelism (neuronx-distributed ) provided by the neuron document.
This is the log and compiler info:
simple.log
`2024-04-01 06:59:57.748428: W torch_xla/csrc/lowering_context.cpp:71] No custom opname metadata! op_type=xla___op_TransferWithStaticRingTransfer
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
self._exec_instr()
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_8_pp1_tp0_dp0] Backward sending grads, but get None
Traceback (most recent call last):
File "run_simple_model_nxd.py", line 289, in <module>
_mp_fn(0, args)
File "run_simple_model_nxd.py", line 225, in _mp_fn
train_simple_model(args)
File "run_simple_model_nxd.py", line 188, in train_simple_model
loss = model.run_train(
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/trainer/model.py", line 25, in run_train
return self.module.run_train(*args, **kwargs)
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 542, in run_train
loss = self._run_train(**kwargs)
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 561, in _run_train
self._exec_schedule(self.train_scheduler)
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 982, in _exec_schedule
self._exec_instr()
File "/home/ubuntu/qwb_venv_pytorch/lib/python3.8/site-packages/neuronx_distributed/pipeline/model.py", line 920, in _bwd_postprocess_task
raise RuntimeError(rmsg("Backward sending grads, but get None"))
RuntimeError: [rank_24_pp3_tp0_dp0] Backward sending grads, but get None`
Other system details:
instance: Trn1
OS: Ubuntu 20.04
If you need other information, please let me konw. Thanks.
I am closing this issue in favor of:
aws-neuron/neuronx-distributed#19
We will track that ticket until resolution instead.