author | title |
---|---|
Adam G |
MoE Paper Experimental Setups |
This repository serves as a collection of notable paper experitmental setups. Do note that these could be incomplete or erroneous for some metrics, if so feel free to raise an issue and I will amend it as soon as possible.
Major tasks examined across these papers:
- Machine Translation (MT) - tested mainly on datasets like WMT (English to French) and BLEU scores are used
- Masked Language Modelling (MLM)
- Language Modelling (LM)
Paper | Year | Expert Size | Total Size | Num Exp (per layer) | Num Layers |
Megablocks | 11/2022 | N/A | 839M-13B | 64 | 3/6/12 |
Deepspeed-MoE | 01/2022 | 1.3/2.4/8/24/47B | 52/107/349/1064.9/2024B | 128 | 24/16/30/40/58 |
Deepspeed-MoE | 01/2022 | 1.3/2.4/8/24/47B | 52/107/349/1064.9/2024B | 128 | 24/16/30/40/58 |
Expert Choice Routing | 02/2022 | 0.145/9.8B | 1.9/143B | 64 | 16 |
Task-Level MoE | 09/2022 | 4096 FFN Size | 533M/13B | 32/128 | 11 |
Hash Layers (vs Switch) | 06/2021 | 4096 FFN Size | 751M/852M/1.28B | 64/16/128 | 1/5/1 |
Hash Layers (vs BASE) | 06/2021 | 100M/33M | 4.5B | 32/3x32 | 1/3 |
GShard | 06/2020 | 8196 FNN Size | 37/150/600B | 128/512/2048 | 12/36 (for each num exp) |
FasterMoE | 03/2022 | 1024/2048/4096 FFN Size | 13.1/13.7/27.4B | 64/16/16 | 12/12/24 |
ST-MoE | 02/2022 | 2816/20480 | 4.1/269B | 32/64 | 6/6 (every 4) |
Random Routing | 09/2022 | 20M-200M | 8/16 | 4/12 | |
Gating Dropout | 05/2022 | 5.6/10B | 128/64 | 12/24 | |
BASE Layers | 03/2021 | 135/335/911M | 1.5/44/117B | 128? | 1 (BASE Layer) |
Switch Transformer | 01/2021 | 768/1024/4096 FFN Size | 7/26/395/1571B | 128/128/64/2048 | 12/24/24/15 (Every other) |
Evo MoE | 12/2021 | 335M(MT/MLM/LM) | 1.5(MT)/1.8(MLM LM) | 4(MT)/16(MLM LM) | 6(MT)/12(MLM LM) |
Stable-MoE (LM) | 04/2022 | 3072/4096 FFN Size | 454M/3.22B | 32/64 | 1/1 |
Stable-MoE (MT) | 04/2022 | 2048 FFN Size | 480M | 32 | 2 |
Outrageously Large MoEs (LM) | 01/2017 | 1M(dims=1024x512) | 0.8/0.9/1.1/1.1/1.9/5.1 | 4/32/256/256/1024/4096 | 1 |
Outrageously Large MoEs (LM-Large) | 01/2017 | 1M | 0.1/0.4/1.2/4.4/17.3/68.9/137.7 | 32 & 256/1024/4096/16384/65536/131072-h | 1 |
Outrageously Large MoEs (MT) | 01/2017 | 2M | 8.7B | 32 & 512/2048-h | 2 (one between stacked encoder and decoder) |
Outrageously Large MoEs (MTMT) | 01/2017 | 8192 FFN Size | 8.7B | 512 | 2 |
NLLB | 07/2022 | 8192 FFN Size/33.6M | 54.5B/51.6B Expert Size | 128 | 6 Exp Layers |
Memory Efficient NLLB | 12/2022 | 8192 FFN Size/33.6M | ~10.32B assuming 80% pruning | ~24 per layer, 288 overall | 6 Exp Layers |
GLaM | 12/2021 | 8192 & 16384 & 32768 FFN Size | 20/27/53 & 105/143B & 1.2T | 32/64/128 & 256/64 & 64 | 24 & 32 & 64 (every other layer) |
Amazon SageMaker | |||||
M6-T Sparse Experts | 05/2021 | 1024x4096 & 1024x21248 | 1.4 & 10.8 & 103.2 & 1002.7B | 32 & 128 & 512 & 960 (Total) | 5 & 10 & 24 & 24 |
\ = Values that are unconfirmed or insinuated from their experiments.
For hardware requirements, slashes denote different configurations.
Paper | Baseline | Hardware Requirements | Memory | Top-K | Capacity |
Megablocks | Transformer-Base to GPT3-XL (46M to 1.3B) | 8x A100 80GB | 1 | 1/1.5/2x | |
Deepspeed-MoE | Scalable MoE | 128x A100 80GB | 2* | 2 | |
Expert Choice Routing | GShard | 512x TPU V4 | N/A* | 2* | |
Task-Level MoE | Transformer Base (142M)/Token/Sentence MoE | 32x TPU V3 | 1 | ||
Hash Layers (vs Switch) | Transformer-Base(225/755M)/Switch Transformer | 8 32GB V100 | *1 | ||
Hash Layers (vs BASE) | BASE Layers | 16 32GB V100 | *1 | ||
GShard | GPipe/Base Transformer | 128/512/2048x TPU V3 | 2 | 2 | |
FasterMoE | FastMoE/ GShard/ BASE | 16/64x V100 | 2 | ||
ST-MoE | Dense-L/ T5 XXL/ Switch XXL | TPU | 2 | 1.25 Cap factor | |
Random Routing | Thor/Transformer Dense | 8x V100 | 1/2/4/8/16 | ||
Gating Dropout | Scalable MoE | 16/64x of V100/A100 | 1 | 1/2(train/test) | |
BASE Layers | SMoE and Switch (52B) | 8/32/128 32GB V100 | |||
Switch Transformer | T5(223M Base/ 739M Large) | 32x TPUv3 | 1 | ||
Evo MoE | Switch/Hash Layers/BASE/StableMoE | 8x A100 | 1 | ||
Stable-MoE (LM) | Switch Transformer/BASE Layer/Hash Layer/Transformer-Base | ?x V100 GPUs | 1 | 1 (from Switch) | |
Stable-MoE (MT) | Transformer-Base and Large/BASE Layer/Hash Layer/Switch | ?x V100 GPUs | 1 | 1 | |
Outrageously Large MoEs (LM) | MoE-1 Wide & Deep/ 4xLSTM-512/LSTM-2048 & 8192 | 4-16x k40s | 4 or 2 for MoE-h | ||
Outrageously Large MoEs (LM-Large) | MoE-1 Wide & Deep/ 4xLSTM-512/LSTM-2048 & 8192 | 32/64/128x k40s | 4 or 2 for MoE-h | ||
Outrageously Large MoEs (MT) | GNMT/PBMT/LSTM-6/DeepAtt | 64 k40s | 4 or 2 for MoE-h | ||
Outrageously Large MoEs (MTMT) | GNMT-Mono/GNMT-Multi | 64 k40s | 2 | ||
NLLB | 101.6GiB/ each GPU holds one expert | ||||
Memory Efficient NLLB | 3.3B NLLB-Dense/NLLB-200 54.5B | 1/4x V100 GPUs | |||
GLaM | Switch/GPT-3/KG-FiD/Megatron-NLG | 1024x TPU v4 (largest) | For largest experts do not fit on a single TPU | 2 | 2* |
Amazon SageMaker | |||||
M6-T Sparse Experts | Their own comparisons with different Top-K | 480 V100 32Gb |
Highest citation number is taken across Google Scholar and Semantic scholar
Paper | Dataset | Batch Size | Open Source | Citations | Notes |
Megablocks | The Pile | 512 | N | 0 | |
Deepspeed-MoE | Lambada/PIQA/BoolQ/RACE-h/Trivia-QA/WebQS | 256/512 | Y | 15/36 | |
Expert Choice Routing | GLaM | N/A | N | 6 | |
Task-Level MoE | WMT | N/A | N | 13 | |
Hash Layers (vs Switch) | Pushshift.io/RoBERTa/Wikitext-103/BST | 40 | Y (partly) | 43 | |
Hash Layers (vs BASE) | Pushshift.io/RoBERTa/Wikitext-103/BST | 2 | Y (partly) | 43 | |
GShard | Custom Dataset | 4M | Y (TPU Only) | 305 | |
FasterMoE | Wiki Text | Y | 22 | ||
ST-MoE | C4 1.5T | 1M | Y | 26 | |
Random Routing | enwik8/Bookcorpus | 128/256 | Under Review | Under Review | |
Gating Dropout | WMT/Web-50 | 435K | N | 1/5 | |
BASE Layers | RoBERTa corpus and CC100 | Y | 64/79 | ||
Switch Transformer | Large C4 Corpus (180B) | 1M | Y | 525 | |
Evo MoE | WMT(MT)/OpenWebText(LM MLM)/Wikipedia/OpenWebText | N/A | Y | 11 | |
Stable-MoE (LM) | RoBERTa and CC100 | 512K | Y | 9 | |
Stable-MoE (MT) | WMT | 512K | Y | 9 | |
Outrageously Large MoEs (LM) | 1B word benchmark | ? | N(but has been recreated) | 1117/1050 | Uses MoE Layer between two LSTMS. 8.4/37.8/272.9/1079/4303M. |
Outrageously Large MoEs (LM-Large) | 100 Billion Google Corpus | 2.5M | "" | "" | Fit up to 1 Billion parameters per GPU. The 64 and 128 GPU tests are for the last two expert models |
Outrageously Large MoEs (MT) | WMT | ? | "" | "" | Fit up to 1 Billion parameters per GPU. |
Outrageously Large MoEs (MTMT) | CoRR | 1M(16K per GPU) | "" | "" | |
NLLB | Flores-200(Eval)/LID curated data/Paracrawl and CommonCrawl (Monolingual) | 16K | Y | 26/49 | Every fourth layer is an MoE layer. |
Memory Efficient NLLB | Flores-200(Eval) | 16K | N | 0 | Releasing some results such as experts pruned etc Every fourth FFN sublayer is replaced with an MoE layer. NLLB-200 requires 4x32 V100s to run. This usesthe 80% pruned model. |
GLaM | GLaM Custom dataset of webpages/wikipedia/forums etc | 1M | N | 59/84 | |
Amazon SageMaker | |||||
M6-T Sparse Experts |