S3DP(simple 3d parallel) aims at implementing data parallel, tensor parallel and pipeline parallel for efficient distributed LLM training. Inspired by Megatron-LM, we want to provide a simple/minimal framework for people with no experience on distribued training so that they can understand its basics.
- Sticking to native PyTorch and Huggingface APIs as much as possible.
- Simple and efficient implementation of only the core functions of 3d parallelism.
just run sh train.sh
This repo is WIP and for study use only.
2023/11/13 -- Model sharding ready, tested GPT2 on 8 GPUs.
2023/12/16 -- Pipeline parallelism forward pass ready. Tested GPT2 model on V100 node(6 gpus) with pp size=6(since GPT2 has 12 blocks, placing 2 blocks on each gpu).
Logs you should see:
Model size | TP | PP | DP | Model FLOPs Utilization |
---|---|---|---|---|
124M |