syncdoth/RetNet

How to use multiple GPUs for model parallel training

zhihui-shao opened this issue · 5 comments

Hi, will you release a method for model parallel training of multiple GPUs

Hey, I'm not the author but use Accelerate and Deespeed config without FP16. Underneath the hood it uses HF Trainer.

@infosechoudini Thanks for chiming in. Yes, the purpose of this repo is to make it HuggingFace compatible, so please do try HF Trainer :)

I've had a few issues with FP16 (numerical stability), which is also noted in the official implementation, so would stick to FP32 for now.

I am working on porting the official implementation to HF, which is almost finished except for chunkwise forward, and just need a few tests and debugging. It has some tricks for stability, which may enable FP16 :)

YAY for FP16!! I've been working on it on my side as well... no luck tho

Update: The official code updates are now in main. It is on par with the original implementation in terms of forward, weight naming, and backward gradients :) (check tests/) One thing is that I would recommend bf16 over fp16, since I personally tested with bf16 only and can confirm that it is stable.

I can also confirm that this model can be trained stably with parallelism, such as fsdp.

Thanks!!!! You're awesome!