[Bug] When `--checkpoint 0` is used, system raise RuntimeError: Tensors must be contiguous
zarzen opened this issue · 1 comments
zarzen commented
To reproduce
deepspeed ~/slapo/examples/gpt/deepspeed_hf.py \
--model_name EleutherAI/gpt-neo-1.3B \
--seq_len 1024 \
--disable_pipeline \
--batch_size 8 \
--iter_nums 40 \
--checkpoint 0
Error Messages
Traceback (most recent call last):
File "/home/zhzhn/slapo/examples/gpt/deepspeed_hf.py", line 320, in <module>
train(args)
File "/home/zhzhn/slapo/examples/gpt/deepspeed_hf.py", line 185, in train
init_weights=model._init_weights,
File "/home/zhzhn/slapo/slapo/schedule.py", line 1158, in build
return init_target_engine(model, target, **kwargs)
File "/home/zhzhn/slapo/slapo/schedule.py", line 1125, in init_target_engine
**kwargs,
File "/home/zhzhn/slapo/slapo/model_dialect/deepspeed/engine.py", line 36, in init_ds_engine
mpu=mpu,
File "/fsx/zhzhn/ZeRO-2D/deepspeed/__init__.py", line 130, in initialize
config_params=config_params)
File "/fsx/zhzhn/ZeRO-2D/deepspeed/runtime/engine.py", line 262, in __init__
self._configure_distributed_model(model)
File "/fsx/zhzhn/ZeRO-2D/deepspeed/runtime/engine.py", line 1052, in _configure_distributed_model
self._broadcast_model()
File "/fsx/zhzhn/ZeRO-2D/deepspeed/runtime/engine.py", line 962, in _broadcast_model
group=self.data_parallel_group)
File "/usr/local/lib64/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1404, in broadcast
work = group.broadcast([tensor], opts)
RuntimeError: Tensors must be contiguous
Other observations
when --checkpoint 1
is used, the error disappeared.
# the following command works
deepspeed ~/slapo/examples/gpt/deepspeed_hf.py \
--model_name EleutherAI/gpt-neo-1.3B \
--seq_len 1024 \
--disable_pipeline \
--batch_size 8 \
--iter_nums 40 \
--checkpoint 1
logging the is_contiguous
property from the DS runtime side
Based on the following log, we can see the is_contiguous
property of weight of output MLP layer is false, when --checkpoint 0
.
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.21.mlp.act.bias, dp size 1, tensor shape torch.Size([1024]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.21.mlp.fc_out.weight, dp size 1, tensor shape torch.Size([2048, 1024]) p is_contiguous False
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.21.mlp.fc_out.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.ln_1.weight, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.ln_1.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
return init_target_engine(model, target, **kwargs)
File "/home/zhzhn/slapo/slapo/schedule.py", line 1218, in init_target_engine
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.attn.attention.module.out_proj.weight, dp size 1, tensor shape torch.Size([2048, 256]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.attn.attention.module.out_proj.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.attn.attention.module.FusedQKV_0.fused_linear.weight, dp size 1, tensor shape torch.Size([768, 2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.attn.attention.module.FusedQKV_0.fused_linear.bias, dp size 1, tensor shape torch.Size([768]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.ln_2.weight, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.ln_2.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.mlp.fc_in.weight, dp size 1, tensor shape torch.Size([1024, 2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.mlp.act.bias, dp size 1, tensor shape torch.Size([1024]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.mlp.fc_out.weight, dp size 1, tensor shape torch.Size([2048, 1024]) p is_contiguous False
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.22.mlp.fc_out.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.ln_1.weight, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.ln_1.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.attn.attention.module.out_proj.weight, dp size 1, tensor shape torch.Size([2048, 256]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.attn.attention.module.out_proj.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.attn.attention.module.FusedQKV_0.fused_linear.weight, dp size 1, tensor shape torch.Size([768, 2048]) p is_contiguous True
mpu=mpu,
File "/fsx/zhzhn/ZeRO-2D/deepspeed/__init__.py", line 130, in initialize
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.attn.attention.module.FusedQKV_0.fused_linear.bias, dp size 1, tensor shape torch.Size([768]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.ln_2.weight, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.ln_2.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.mlp.fc_in.weight, dp size 1, tensor shape torch.Size([1024, 2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.mlp.act.bias, dp size 1, tensor shape torch.Size([1024]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.mlp.fc_out.weight, dp size 1, tensor shape torch.Size([2048, 1024]) p is_contiguous False
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.h.23.mlp.fc_out.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,016] [INFO] [engine.py:952:_broadcast_model] name transformer.ln_f.weight, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,017] [INFO] [engine.py:952:_broadcast_model] name transformer.ln_f.bias, dp size 1, tensor shape torch.Size([2048]) p is_contiguous True
[2023-02-02 19:24:11,017] [INFO] [engine.py:952:_broadcast_model] name lm_head.weight, dp size 1, tensor shape torch.Size([6283, 2048]) p is_contiguous True
zarzen commented
Shown to be fixed by recent PRs