/xft-moe

XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

Primary LanguagePythonApache License 2.0Apache-2.0

XFT: Unlocking the Power of Code Instruction Tuning by Simply Merging Upcycled Mixture-of-Experts

Important

We are constantly working on cleaning the code, improving the documentation, and adding more implementation details. Plese stay tuned!

We build XFT based on the implementation of Magicoder (https://github.com/ise-uiuc/magicoder). To set up environment for experiments on DeepSeek-Coder-1.3B, please run the following command:

conda env create -f xft_env.yml
conda activate xft
pip install flash-attn==2.1.0 --no-build-isolation

To obtain XFT_DS, you need to run the code step by step as follows:

Step 1: Upcycle an MoE model from DeepSeek-Coder-1.3B Base.

export PYTHONPATH=:[YOUR_HOME_PATH]/xft/src:[YOUR_HOME_PATH]/xft/src/magicoder
cd [YOUR_HOME_PATH]/xft/src/magicoder
python convert_dense_to_moe.py \
 --model deepseek-ai/deepseek-coder-1.3b-base \
 --save_path "deepseek-coder-8x1.3b-top-6-moe-base"

Step 2: Download Evol-Instruct dataset and put it under xft/data folder.

Instruction tune the upcycled MoE model on evol-instruct dataset.

bash train_moe.sh

Evaluate the instruction-tuned MoE model on HumanEval(+).

bash test_moe.sh

Step 3: Extract FFN weights from the instruction-tuned MoE model.

python convert_moe_to_ffn.py \
 --model "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4" \
 --save_path "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_ffn"

Step 4: Set the shared_expert_weight ($lambda$) and ffn_folder_path (path to the folder of FFN weights) in the config file of the instruction-tuned MoE model (ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4/config.json) before learning the mixing coefficients.

Step 5: Initialize the mixing coefficients which aims to merge the experts in the instruction-tuned MoE model.

python convert_moe_to_weighted.py \
 --model "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4" \
 --save_path "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_weighted_dense" \
 --num_experts 8

Step 6: Learn the mixing coefficients on evol-instruct dataset.

bash train_weighted.sh

Step 7: Merge the instruction-tuned MoE model based on the learned mixing coefficients. Now you will get a instruction-tuned model that has the same architecture as DeepSeek-Coder-1.3B Base.

python convert_weighted_to_dense.py \
 --model_moe "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4" \
 --model_dense "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_weighted_dense-lambda-75-1e-5_bs_64_epoch_1" \
 --save_path "ds-8x1.3b-top-6-universal-evol-instruct-5e-5_bs_64_epoch_4_weighted_dense-lambda-75-1e-5_bs_64_epoch_1-dense" \
 --num_experts 8 \
 --shared_expert_weight 0.75

Evaluate the final model on HumanEval(+).

bash test.sh