mergoo
is a library for easily merging multiple LLM experts, and efficiently train the merged LLM. With mergoo
, you can efficiently integrate the knowledge of different generic or domain-based LLM experts.
- Supports several merging methods: Mixture-of-Experts, Mixture-of-Adapters, and Layer-wise merging
- Flexible merging for each layer
- Base Models supported : Llama(including LLaMa3), Mistral, Phi3, and BERT
- Trainers supported : 🤗 Trainer, SFTrainer, PEFT
- Device Supported: CPU, MPS, GPU
- Training choices: Only Router of MoE layers, or Fully fine-tuning of Merged LLM
If you like the project, consider leaving a ⭐️
Install by pip:
pip install mergoo
Install latest unstable version on Github:
pip install git+https://github.com/Leeroo-AI/mergoo
Install it from the source:
git clone https://github.com/Leeroo-AI/mergoo
cd mergoo
pip install -e .
Specify the config for merging:
model_type
: type of base model. choices:mistral
,llama
, orbert
.num_experts_per_token
: Number of experts for each token of MoE.experts
: config for experts to merge. includesexpert_name
and Hugging Face 🤗model_id
.router_layers
: layers chosen for applying Mixture-of-Experts.
This is a sample config when merging fully fine-tuned LLM experts.
config = {
"model_type": "mistral",
"num_experts_per_tok": 2,
"experts": [
{"expert_name": "base_expert", "model_id": "mistralai/Mistral-7B-v0.1"},
{"expert_name": "expert_1", "model_id": "meta-math/MetaMath-Mistral-7B"},
{"expert_name": "expert_2", "model_id": "ajibawa-2023/Code-Mistral-7B"}
],
"router_layers": ["gate_proj", "up_proj", "down_proj"]
}
For the above example, we merged math and code mistral-based experts. Please refer to this notebook for further details!
This is a sample config when merging LoRA fine-tuned LLM experts. mergoo
builds a routing layer on top of LoRAs, resulting in a mixture of adapters.
config = {
"model_type": "mistral",
"num_experts_per_tok": 2,
"base_model": "mistralai/Mistral-7B-v0.1",
"experts": [
{"expert_name": "adapter_1", "model_id": "predibase/customer_support"},
{"expert_name": "adapter_2", "model_id": "predibase/customer_support_accounts"},
{"expert_name": "adapter_3", "model_id": "predibase/customer_support_orders"},
{"expert_name": "adapter_4", "model_id": "predibase/customer_support_payments"}
],
}
The expert_name
starts with adapter
instead of expert
. Please refer to this notebook for further details!
Following the config setup, mergoo
creates the merged LLM as:
import torch
from mergoo.compose_experts import ComposeExperts
# create checkpoint
model_id = "data/mistral_lora_moe"
expertmerger = ComposeExperts(config, torch_dtype=torch.float16)
expertmerger.compose()
expertmerger.save_checkpoint(model_id)
Now, you can easily train the merged LLM with Hugging Face Trainer:
from transformers import Trainer
from mergoo.models.modeling_mistral import MistralForCausalLM
model = MistralForCausalLM.from_pretrained("data/mistral_lora_moe")
# NOTE: 'gate' / router layers are untrained hence weight loading warning would appeare for them
trainer = Trainer( ... )
trainer.train()
After finishing the Quick Start guide, you can explore the tutorials below to further familiarize yourself with mergoo
.
Notebook | Details |
---|---|
MoE with fully fine-tuned LLM experts | Build a unifined Mixture-of-Experts model with fully fine-tuned experts. Inspired by BTX Research (Meta AI). |
MoE with LoRA fine-tuned experts | Build a Mixture of Adaptes expert. Inspired by xlora | Mixture-of-LoRAs | MoLE | PHATGOOSE | MoELoRA |
Hugging Face Blog | Deep dive into research details behind the merging methods of mergoo library |
LLaMa3-based Experts | Build your own MoE-style LLM experts by integrating LLaMa3-based domain experts |
Phi3-based Experts | Create MoE-style LLM architecture by merging Phi3-based fine-tuned models |
As an open-source library in a fast evolving domain, we welcome contributions, whether it is introducing new features, enhancing infrastructure, or improving documentation.
Here is mergoo
roadmap:
- Support MoE for Transformer Block
- Compatibility with Huggingface 🤗
- Support Trainer, SFTrainer
- Loading Unified Checkpoint in BTX
- Feature: Convertible QKV linear layers
- Feature: Convertible FF linear layers
- Feature: Routers only for a list of decoder layers indexes
- Sharded Safetensor Saving
- Support experts based on LLaMa and Mistral
- Support experts based on Phi3
- Support Mixture of LORA Experts (Mixture of Adapters)
- Router Load balancing loss
- Lazy loading of tensors for low memory usage in Merging
- Support other Layer-wise merging methods, including Mergekit
- Support experts based on Gemma and Mamba
- Support flash-attention
- Support Mixture of Depths Transformer
Feel free to suggest new features and/or contribute to mergoo
roadmap!
🚀 We love to here your feedback, please join Leeroo community:
Have a question not listed here? Open a GitHub Issue or send us an email!