OpenMoE

| Blog | Twitter | Discord |

OpenMoE is a project aimed at igniting the open-source MoE community! We are releasing a family of open-sourced Mixture-of-Experts (MoE) Large Language Models.

Since we are a small team working on a huge project, we cannot handle everything. Instead, we release some intermediate checkpoints in this repo to invite more contributors to work on open-sourced MoE project together!

News

[2023/08] 🔥 We released an intermediate OpenMoE-8B checkpoint (OpenMoE-v0.2) along with two other models. Check out the blog post.

TODO List

PyTorch Implementation with Colossal AI
More Evaluation
Continue Training to 1T tokens
Paper

Model Weights
Get Started
Approach
License
Authors
Citation

Model Weights

Currently, three models are released in total.

Model Name	Description	#Param	GCS	Huggingface	Gin File
OpenMoE-base/16E	A small MoE model for debugging	637M	gs://openmoe/openmoe-base/checkpoint_500000	Link	Link
OpenLLaMA-base	A dense counter-part of OpenMoE-base	310M	gs://openmoe/openllama-base/checkpoint_500000	Link	Link
OpenMoE-8B/32E	8B MoE with comparable FLOPs of a 2B LLaMA	8B	gs://openmoe/openmoe-8b/checkpoint_100000	Link	Link

We release all these checkpoints on Huggingface and Google Cloud Storage. For instance, you can download openmoe-8B with

gsutil cp -r gs://openmoe/openmoe-8b/checkpoint_100000 $YOUR_DIR

The base models are trained with 128B tokens. The openmoe-8B checkpoint with 4 MoE layers and 32 experts has been trained by 200B tokens. We are still training OpenMoE-8B. So if you are interested in the latest checkpoint, please feel free to drop Fuzhao an email (f.xue@u.nus.edu). In addition, we are highly interested in training this model until saturate by performing multi-epoch training, which means we may train our model for over 2T and even more tokens (this depends on the resource we can get in the coming months)

Note: downloading data from Google Cloud Storage is not free, but you can sign in to Google Cloud and get some credits.

Get Started

Training

Get a TPU-vm and run the following code on all TPUs. Researcher can apply TPU Research Cloud to get the TPU resource.

We are working on the PyTorch + GPU implementation with Colossal AI.

git clone https://github.com/XueFuzhao/OpenMoE.git
bash OpenMoE/script/run_pretrain.sh

Eval

Get a TPU-vm and run the following code on all TPUs.

git clone https://github.com/XueFuzhao/OpenMoE.git
bash OpenMoE/script/run_eval.sh

Approach

Data

50% The RedPajama + 50% The Stack Dedup. We use a high ratio of coding data to improve reasoning ability.

Tokenizer

We use the umt5 Tokenizer to support multi-lingual continue learning in the future, which can be downloaded on Huggingface or Google Cloud.

Model Architecture

OpenMoE is based on ST-MoE but uses Decoder-only architecture. The detailed implementation can be found in Fuzhao's T5x and Flaxformer repo.

Training Objective

We use a modified UL2 training objective but Casual Attention Mask (We use more prefix LM and high mask ratio because it saves computation.):

50% prefix LM
10% span len=3 mask ratio=0.15
10% span len=8 mask ratio=0.15
10% span len=3 mask ratio=0.5
10% span len=8 mask ratio=0.5
10% span len=64 mask ratio=0.5

Other Designs

RoPE, SwiGLU activation, 2K context length. We will release a more detailed report soon.

Evaluation

We evaluate our model on TrivalQA and BigBench-Lite as our first step. We plot the cost-effectiveness curve in the figure below.

Relative Cost is approximated by multiplying activated parameters and training tokens. The size of dots denotes the number of activated parameters for each token. The lightgray dot denotes the total parameters of MoE models.

For more detailed results, please see our Blog

License

Our code is under Apache 2.0 License.

Since the models are trained on The Redpajama and The Stack dataset, please check the license of these two datasets for your model usage.

Authors

This project is currently contributed by the following authors:

Citation

Please cite the repo if you use the model and code in this repo.

@misc{openmoe2023,
  author = {Fuzhao Xue, Zian Zheng, Yao Fu, Jinjie Ni, Zangwei Zheng, Wangchunshu Zhou and Yang You},
  title = {OpenMoE: Open Mixture-of-Experts Language Models},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/XueFuzhao/OpenMoE}},
}

SuperXiang/OpenMoE