/awesome_moe

The collections of MOE (Mixture Of Expert) papers, code and tools, etc.

Awesome_MOE

🚀 The collections of MOE (Mixture Of Expert) papers, codes and tools, etc.

Paper Lists

  • [2024/03] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM [paper]
  • [2024/01] Chinese-Mixtral-8x7B. [github]
  • [2024/01] MoLE: Mixture of LoRA Experts [paper]
  • [2024/01] Sparse MoE with Language Guided Routing for Multilingual Machine Translation. [paper][code]
  • [2024/01] Scalable Modular Network: A Framework for Adaptive Learning via Agreement Routing [paper]
  • [2023/12] Mixtral 8x7B. [blog][paper][code]
  • [2023/10] Merge, Then Compress: Demystify Efficient SMoE with Hints from Its Routing Policy [paper]
  • [2023/10] Emergent Mixture-of-Experts: Can Dense Pre-trained Transformers Benefit from Emergent Modular Structures? [paper]
  • [2023/09] Large Language Model Routing with Benchmark Datasets [paper]
  • [2023/07] llama-moe: Building Mixture-of-Experts from LLaMA with Continual Pre-training [paper][code]
  • [2023/05] Mixture-of-Experts Meets Instruction Tuning: A Winning Combination for Large Language Models [paper]
  • [2023/03] PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing. arXiv 2023. [paper]
  • [2022] Eliciting and Understanding Cross-Task Skills with Task-Level Mixture-of-Experts. Findings of EMNLP 2022. [paper]
  • [2022/04] Sparsely activated mixture-of-experts are robust multi-task learners. [paper]
  • [2022/01] GLaM: Efficient Scaling of Language Models with Mixture-of-Experts. ICML 2022. [paper]
  • [2022/01] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity[paper]
  • [2022] Spatial Mixture-of-Experts. [paper]
  • [2021/08] DEMix Layers: Disentangling Domains for Modular Language Modeling. [paper]
  • [2021] Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. Findings of EMNLP 2021. [paper]
  • [2021] Scaling vision with sparse mixture of experts. [paper]
  • [2020/06] GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [paper]
  • [2018/01] Practical and theoretical aspects of mixture‐of‐experts modeling: An overview
  • [2017/01] Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [paper]