Model partitioning for pipeline parallelism
xrsrke opened this issue · 1 comments
xrsrke commented
Basically, take a transformers and num_pipeline_stage as arguments, then divide the module like this:
The first stage and the last stage must include the embedding layer and lm_head, respectively.
All other stages in between should be divided evenly.
For example: if we have [embedding layer] > [8 x transformer blocks] > [language model head], and we want to shard them into 5 pipeline stages:
- The first partition includes the embedding layer and the first block.
- The 3 partitions in between each consist of 2 transformer blocks.
The last partition includes the language model head and the last block.
The goal is to arrange the first and the last pipeline stages so they do not become bottlenecks in terms of training speed, while all stages in between are distributed evenly to balance the computation.
abourramouss commented
pr #28 is a first approach. I am trying to understand how to combine wte and wpe.