Mixture of Experts

Question

Mixture of Experts

xrsrke opened this issue a year ago · 0 comments

APIs

from pipegoose.nn.expert_parallel import ExpertParallel, ExpertLoss

parallel_context = ParallelContext.from_torch(expert_parallel_size=8)

mlp = CustomExpert()
router = CustomRouter()
noise_policy = CustomNoisePolicy()
loss_func = nn.CrossEntropy()

model = ExpertParallel(
     model,
     expert=mlp,
     router=router,
     noise_policy=noise_policy,
     enable_tensor_parallelism=True,
     parallel_context=parallel_context,
).parallelize()

loss_func = ExpertLoss(loss_func, aux_weight=0.1)

TODOs

Engineering Reading

Pipeline MoE - A Flexible MoE Implementation with Pipeline Parallelism
DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
DeepSpeed-TED: Tensor-Expert-Data Parallelism Optimize Hybrid: A Approach to Mixture-of-Experts Training
MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
MegaBlocks - Efficient Sparse Training with Mixture-of-Experts

MoE Reading

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
ST-MoE: Designing Stable and Transferable Sparse Expert Models
Mixture-of-Experts with Expert Choice Routing