如何实现多节点fsdp
boyue-jiang opened this issue · 1 comments
boyue-jiang commented
您好,我在论文中看到你们在pretrain阶段用32张卡训练。我想请问如何用trainer fsdp实现多节点训练呢。例如我想在2个节点16个A100上训练,应该怎么用trainer实现,模型是会切片分到16个gpu上吗?
fwyc0573 commented
I also met the similar problem. I want to use Pytorch's FSDP to train among muti nodes, but the process blocked. Is there any configuration or example i can follow?