xrsrke/pipegoose

Mixture of Experts

xrsrke opened this issue · 0 comments

xrsrke commented

APIs

from pipegoose.nn.expert_parallel import ExpertParallel, ExpertLoss

parallel_context = ParallelContext.from_torch(expert_parallel_size=8)

mlp = CustomExpert()
router = CustomRouter()
noise_policy = CustomNoisePolicy()
loss_func = nn.CrossEntropy()

model = ExpertParallel(
     model,
     expert=mlp,
     router=router,
     noise_policy=noise_policy,
     enable_tensor_parallelism=True,
     parallel_context=parallel_context,
).parallelize()

loss_func = ExpertLoss(loss_func, aux_weight=0.1)

TODOs

  • Top-1, Top-2 router

  • ExpertParallel (turn a 🤗 transformers to a MoE automatically)

  • Does expert embedding need to multiply its corresponding router probability?

  • Make ExpertParallel work with data parallelism

    • Create a new process group for experts across data parallelism dimension
    • Register a backward hook between the same expert across data parallelism dimension
  • Optionally apply tensor parallelism to an expert layer

  • Make ExpertParallel work in pipeline parallelism

  • Make ExpertParallel work with ZeRO-1

  • Loss function (include aux and z loss)

  • Move inputs to target expert device

Engineering Reading

  • Pipeline MoE - A Flexible MoE Implementation with Pipeline Parallelism
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale
  • DeepSpeed-TED: Tensor-Expert-Data Parallelism Optimize Hybrid: A Approach to Mixture-of-Experts Training
  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts
  • FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models
  • MegaBlocks - Efficient Sparse Training with Mixture-of-Experts

MoE Reading

  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts
  • ST-MoE: Designing Stable and Transferable Sparse Expert Models
  • Mixture-of-Experts with Expert Choice Routing