/mergoo

A library for easily merging multiple LLM experts, and efficiently train the merged LLM.

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

Mergoo Leeroo logo

made-with-python License: LPGLv3.0 Version

mergoo is a library for easily merging multiple LLM experts, and efficiently train the merged LLM. With mergoo, you can efficiently integrate the knowledge of different generic or domain-based LLM experts.

🚀 Features

  • Supports several merging methods: Mixture-of-Experts, Mixture-of-Adapters, and Layer-wise merging
  • Flexible merging for each layer
  • Base Models supported : Llama(including LLaMa3), Mistral, Phi3, and BERT
  • Trainers supported : 🤗 Trainer, SFTrainer, PEFT
  • Device Supported: CPU, MPS, GPU
  • Training choices: Only Router of MoE layers, or Fully fine-tuning of Merged LLM

If you like the project, consider leaving a ⭐️

Installation

Install by pip:

pip install mergoo

Install latest unstable version on Github:

pip install git+https://github.com/Leeroo-AI/mergoo

Install it from the source:

git clone https://github.com/Leeroo-AI/mergoo
cd mergoo
pip install -e .

Quick Start

Configuration Setup

Specify the config for merging:

  • model_type: type of base model. choices: mistral, llama, or bert.
  • num_experts_per_token: Number of experts for each token of MoE.
  • experts: config for experts to merge. includes expert_name and Hugging Face 🤗model_id.
  • router_layers: layers chosen for applying Mixture-of-Experts.

Fully Fine-tuned Experts

This is a sample config when merging fully fine-tuned LLM experts.

config = {
    "model_type": "mistral",
    "num_experts_per_tok": 2,
    "experts": [
        {"expert_name": "base_expert", "model_id": "mistralai/Mistral-7B-v0.1"},
        {"expert_name": "expert_1", "model_id": "meta-math/MetaMath-Mistral-7B"},
        {"expert_name": "expert_2", "model_id": "ajibawa-2023/Code-Mistral-7B"}
    ],
    "router_layers": ["gate_proj", "up_proj", "down_proj"]
}

For the above example, we merged math and code mistral-based experts. Please refer to this notebook for further details!

Mixture of Adapters (MoE on LoRA)

This is a sample config when merging LoRA fine-tuned LLM experts. mergoo builds a routing layer on top of LoRAs, resulting in a mixture of adapters.

config = {
    "model_type": "mistral",
    "num_experts_per_tok": 2,
    "base_model": "mistralai/Mistral-7B-v0.1",
    "experts": [
        {"expert_name": "adapter_1", "model_id": "predibase/customer_support"},
        {"expert_name": "adapter_2", "model_id": "predibase/customer_support_accounts"},
        {"expert_name": "adapter_3", "model_id": "predibase/customer_support_orders"},
        {"expert_name": "adapter_4", "model_id": "predibase/customer_support_payments"}
    ],
}

The expert_name starts with adapter instead of expert. Please refer to this notebook for further details!

Merge Experts

Following the config setup, mergoo creates the merged LLM as:

import torch
from mergoo.compose_experts import ComposeExperts

# create checkpoint
model_id = "data/mistral_lora_moe"
expertmerger = ComposeExperts(config, torch_dtype=torch.float16)
expertmerger.compose()
expertmerger.save_checkpoint(model_id)

Load / Finetune Merged Expert

Now, you can easily train the merged LLM with Hugging Face Trainer:

from transformers import Trainer
from mergoo.models.modeling_mistral import MistralForCausalLM

model = MistralForCausalLM.from_pretrained("data/mistral_lora_moe") 
# NOTE: 'gate' / router layers are untrained hence weight loading warning would appeare for them

trainer = Trainer( ... )
trainer.train()

📚 Learn More:

After finishing the Quick Start guide, you can explore the tutorials below to further familiarize yourself with mergoo.

Notebook Details
MoE with fully fine-tuned LLM experts Build a unifined Mixture-of-Experts model with fully fine-tuned experts. Inspired by BTX Research (Meta AI).
MoE with LoRA fine-tuned experts Build a Mixture of Adaptes expert. Inspired by xlora | Mixture-of-LoRAs | MoLE | PHATGOOSE | MoELoRA
Hugging Face Blog Deep dive into research details behind the merging methods of mergoo library
LLaMa3-based Experts Build your own MoE-style LLM experts by integrating LLaMa3-based domain experts
Phi3-based Experts Create MoE-style LLM architecture by merging Phi3-based fine-tuned models

Mergoo Roadmap and Contributing

As an open-source library in a fast evolving domain, we welcome contributions, whether it is introducing new features, enhancing infrastructure, or improving documentation.

Here is mergoo roadmap:

  • Support MoE for Transformer Block
  • Compatibility with Huggingface 🤗
  • Support Trainer, SFTrainer
  • Loading Unified Checkpoint in BTX
  • Feature: Convertible QKV linear layers
  • Feature: Convertible FF linear layers
  • Feature: Routers only for a list of decoder layers indexes
  • Sharded Safetensor Saving
  • Support experts based on LLaMa and Mistral
  • Support experts based on Phi3
  • Support Mixture of LORA Experts (Mixture of Adapters)
  • Router Load balancing loss
  • Lazy loading of tensors for low memory usage in Merging
  • Support other Layer-wise merging methods, including Mergekit
  • Support experts based on Gemma and Mamba
  • Support flash-attention
  • Support Mixture of Depths Transformer

Feel free to suggest new features and/or contribute to mergoo roadmap!

Join our community!

🚀 We love to here your feedback, please join Leeroo community:

Have a question not listed here? Open a GitHub Issue or send us an email!