Mixture of Experts
xrsrke opened this issue · 0 comments
xrsrke commented
APIs
from pipegoose.nn.expert_parallel import ExpertParallel, ExpertLoss
parallel_context = ParallelContext.from_torch(expert_parallel_size=8)
mlp = CustomExpert()
router = CustomRouter()
noise_policy = CustomNoisePolicy()
loss_func = nn.CrossEntropy()
model = ExpertParallel(
model,
expert=mlp,
router=router,
noise_policy=noise_policy,
enable_tensor_parallelism=True,
parallel_context=parallel_context,
).parallelize()
loss_func = ExpertLoss(loss_func, aux_weight=0.1)
TODOs
-
Top-1, Top-2 router
-
ExpertParallel
(turn a 🤗transformers
to a MoE automatically) -
Does expert embedding need to multiply its corresponding router probability?
-
Make
ExpertParallel
work with data parallelism- Create a new process group for experts across data parallelism dimension
- Register a backward hook between the same expert across data parallelism dimension
-
Optionally apply tensor parallelism to an expert layer
-
Make
ExpertParallel
work in pipeline parallelism -
Make
ExpertParallel
work with ZeRO-1 -
Loss function (include aux and z loss)
-
Move inputs to target expert device
Engineering Reading
- Pipeline MoE - A Flexible MoE Implementation with Pipeline Parallelism
- DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
- DeepSpeed-TED: Tensor-Expert-Data Parallelism Optimize Hybrid: A Approach to Mixture-of-Experts Training
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
- FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
- MegaBlocks - Efficient Sparse Training with Mixture-of-Experts
MoE Reading
- Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
- GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
- ST-MoE: Designing Stable and Transferable Sparse Expert Models
- Mixture-of-Experts with Expert Choice Routing