/awesome-mixture-of-experts

A collection of AWESOME things about mixture-of-experts

awesome-mixture-of-experts Awesome

MIT License

A collection of AWESOME things about mixture-of-experts

This repo is a collection of AWESOME things about mixture-of-experts, including papers, code, etc. Feel free to star and fork.

Contents

Open Models

  • DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models [Jan 2024] Repo Paper
  • LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training [Dec 2023] Repo
  • Mixtral of Experts [Dec 2023] Repo Paper
  • OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models [Aug 2023] Repo Paper
  • Efficient Large Scale Language Modeling with Mixtures of Experts [Dec 2021] Repo Paper
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [Feb 2021] Repo Paper

Papers

Must Read

I list my favorite MoE papers here. I think these papers can greatly help new MoErs to know about this topic.

  • A Review of Sparse Expert Models in Deep Learning [4 Sep 2022]
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [11 Jan 2021]
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [13 Dec 2021]
  • Scaling Vision with Sparse Mixture of Experts [NeurIPS2021]
  • ST-MoE: Designing Stable and Transferable Sparse Expert Models [17 Feb 2022]
  • Mixture-of-Experts with Expert Choice Routing [NeurIPS 2022]
  • Brainformers: Trading Simplicity for Efficiency [ICML 2023]
  • From Sparse to Soft Mixtures of Experts [2 Aug 2023]
  • OpenMoE: An Early Effort on Open Mixture-of-Experts Language Models Aug 2023

MoE Model

Publication

  • Patch-level Routing in Mixture-of-Experts is Provably Sample-efficient for Convolutional Neural Networks [ICML 2023]
  • Robust Mixture-of-Expert Training for Convolutional Neural Networks [ICCV 2023]
  • Merging Experts into One: Improving Computational Efficiency of Mixture of Experts [EMNLP 2023]
  • PAD-Net: An Efficient Framework for Dynamic Networks [ACL 2023]
  • Brainformers: Trading Simplicity for Efficiency [ICML 2023]
  • On the Representation Collapse of Sparse Mixture of Experts [NeurIPS 2022]
  • StableMoE: Stable Routing Strategy for Mixture of Experts [ACL 2022]
  • Taming Sparsely Activated Transformer with Stochastic Experts [ICLR 2022]
  • Go Wider Instead of Deeper [AAAI2022]
  • Hash layers for large sparse models [NeurIPS2021]
  • DSelect-k: Differentiable Selection in the Mixture of Experts with Applications to Multi-Task Learning [NeurIPS2021]
  • Scaling Vision with Sparse Mixture of Experts [NeurIPS2021]
  • BASE Layers: Simplifying Training of Large, Sparse Models [ICML2021]
  • Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [ICLR2017]
  • CPM-2: Large-scale cost-effective pre-trained language models [AI Open]
  • Mixture of experts: a literature survey [Artificial Intelligence Review]

arXiv

  • MoEC: Mixture of Expert Clusters [19 Jul 2022]
  • No Language Left Behind: Scaling Human-Centered Machine Translation [6 Jul 2022]
  • Sparse Fusion Mixture-of-Experts are Domain Generalizable Learners [8 Jun 2022]
  • Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts [6 Jun 2022]
  • Patcher: Patch Transformers with Mixture of Experts for Precise Medical Image Segmentation [5 Jun 2022]
  • Interpretable Mixture of Experts for Structured Data [5 Jun 2022]
  • Task-Specific Expert Pruning for Sparse Mixture-of-Experts [1 Jun 2022]
  • Gating Dropout: Communication-efficient Regularization for Sparsely Activated Transformers [28 May 2022]
  • AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models [24 May 2022]
  • Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT [24 May 2022]
  • One Model, Multiple Modalities: A Sparsely Activated Approach for Text, Sound, Image, Video and Code [12 May 2022]
  • SkillNet-NLG: General-Purpose Natural Language Generation with a Sparsely Activated Approach [26 Apr 2022]
  • Residual Mixture of Experts [20 Apr 2022]
  • Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners [16 Apr 2022]
  • MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [15 Apr 2022]
  • Mixture-of-experts VAEs can disregard variation in surjective multimodal data [11 Apr 2022]
  • Efficient Language Modeling with Sparse all-MLP [14 Mar 2022]
  • Parameter-Efficient Mixture-of-Experts Architecture for Pre-trained Language Models [2 Mar 2022]
  • Mixture-of-Experts with Expert Choice Routing [18 Feb 2022]
  • ST-MoE: Designing Stable and Transferable Sparse Expert Models [17 Feb 2022]
  • Designing Effective Sparse Expert Models [17 Feb 2022]
  • Unified Scaling Laws for Routed Language Models [2 Feb 2022]
  • Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model [28 Jan 2022]
  • One Student Knows All Experts Know: From Sparse to Dense [26 Jan 2022]
  • Dense-to-Sparse Gate for Mixture-of-Experts [29 Dec 2021]
  • Efficient Large Scale Language Modeling with Mixtures of Experts [20 Dec 2021]
  • GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [13 Dec 2021]
  • Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition [10 Dec 2021]
  • SpeechMoE2: Mixture-of-Experts Model with Improved Routing [23 Nov 2021]
  • VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts [23 Nov 2021]
  • Towards More Effective and Economic Sparsely-Activated Model [14 Oct 2021]
  • M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [8 Oct 2021]
  • Sparse MoEs meet Efficient Ensembles [7 Oct 2021]
  • MoEfication: Conditional Computation of Transformer Models for Efficient Inference [5 Oct 2021]
  • Cross-token Modeling with Conditional Computation [5 Sep 2021]
  • M6-T: Exploring Sparse Expert Models and Beyond [31 May 2021]
  • SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts [7 May 2021]
  • Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [11 Jan 2021]
  • Exploring Routing Strategies for Multilingual Mixture-of-Experts Models [28 Sept 2020]

MoE System

Publication

  • Pathways: Asynchronous Distributed Dataflow for ML [MLSys2022]
  • Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning [OSDI2022]
  • FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models[PPoPP2022]
  • BaGuaLu: Targeting Brain Scale Pretrained Models with over 37 Million Cores [PPoPP2022]
  • GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding [ICLR2021]

arXiv

  • MegaBlocks: Efficient Sparse Training with Mixture-of-Experts [29 Nov 2022]
  • HetuMoE: An Efficient Trillion-scale Mixture-of-Expert Distributed Training System [28 Mar 2022]
  • SE-MoE: A Scalable and Efficient Mixture-of-Experts Distributed Training and Inference System [20 Mar 2022]
  • DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale [14 Jan 2022]
  • SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient [29 Sep 2021]
  • FastMoE: A Fast Mixture-of-Expert Training System [24 Mar 2021]

MoE Application

Publication

  • Switch-NeRF: Learning Scene Decomposition with Mixture of Experts for Large-scale Neural Radiance Fields [02 Feb 2023]

arXiv

  • Spatial Mixture-of-Experts [24 Nov 2022]
  • A Mixture-of-Expert Approach to RL-based Dialogue Management [31 May 2022]
  • Pluralistic Image Completion with Probabilistic Mixture-of-Experts [18 May 2022]
  • ST-ExpertNet: A Deep Expert Framework for Traffic Prediction [5 May 2022]
  • Build a Robust QA System with Transformer-based Mixture of Experts [20 Mar 2022]
  • Mixture of Experts for Biomedical Question Answering [15 Apr 2022]

Library