How to use multiple GPUs for model parallel training
zhihui-shao opened this issue · 5 comments
Hi, will you release a method for model parallel training of multiple GPUs
Hey, I'm not the author but use Accelerate and Deespeed config without FP16. Underneath the hood it uses HF Trainer.
@infosechoudini Thanks for chiming in. Yes, the purpose of this repo is to make it HuggingFace compatible, so please do try HF Trainer :)
I've had a few issues with FP16 (numerical stability), which is also noted in the official implementation, so would stick to FP32 for now.
I am working on porting the official implementation to HF, which is almost finished except for chunkwise forward, and just need a few tests and debugging. It has some tricks for stability, which may enable FP16 :)
YAY for FP16!! I've been working on it on my side as well... no luck tho
Update: The official code updates are now in main. It is on par with the original implementation in terms of forward, weight naming, and backward gradients :) (check tests/
) One thing is that I would recommend bf16
over fp16
, since I personally tested with bf16
only and can confirm that it is stable.
I can also confirm that this model can be trained stably with parallelism, such as fsdp
.
Thanks!!!! You're awesome!