Training 70b model
Yummy416 opened this issue · 1 comments
If I want a 70b SFT model, how should I set the parameters? For full-parameter training, I currently use your framework to allocate at least one batch_size to each GPU. What if I want multiple GPUs to run one batch_size?
Thanks for your interest in LMFlow! Currently, the batch size is required to be at least 1 per GPU. This difference in data should not incur too much memory cost.
According to our experience, the major memory cost comes from model, which is already handled by LMFlow's model parallelism mechanism and split into smaller parts across gpus. Considering the overhead, it is recommended to use a GPU server with total memory (RAM+GPU memory) at least 2.5T to run full fine-tuning of the 70b model, or use multi-node training or LoRA training. Hope that helps 🙏