BlinkDL/RWKV-LM

请教一下,训练RWKV-4-Pile-3B-20221008-8023,提示错误

XxSuper opened this issue · 3 comments

torch==2.0.0+cu118 torchvision==0.15.1+cu118 torchaudio==2.0.1+cu118
deepspeed 0.12.4
pytorch-lightning 2.1.2
提示报错:
AttributeError: "MyDataset' object has no attribute 'global rank'

IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.2 torch 1.13.1+cu117

IMPORTANT: Use deepspeed==0.7.0 pytorch-lightning==1.9.2 torch 1.13.1+cu117

感谢指导,上述问题已解决,但是出现双卡训练崩溃问题,请教一下是什么原因导致?
Loading extension module fused_adam...
Time to load fused_adam op: 2.221088409423828 seconds
Loading extension module fused_adam...
Time to load fused_adam op: 2.2072091102600098 seconds
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
warnings.warn(
/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py:429: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
warnings.warn(
Bus error (core dumped)

要看具体错误,请截全