π Main Repository | π Full Documentation
Cook up amazing multimodal AI applications effortlessly with MiniCPM-o, bringing vision, speech, and live-streaming capabilities right to your fingertips!
Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.
We support a wide range of users, from individuals to enterprises and researchers.
- Individuals: Enjoy effortless inference using Ollama and Llama.cpp with minimal setup.
- Enterprises: Achieve high-throughput, scalable performance with vLLM and SGLang.
- Researchers: Leverage advanced frameworks including Transformers , LLaMA-Factory, SWIFT, and Align-anything to enable flexible model development and cutting-edge experimentation.
Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.
- Web demo: Launch interactive multimodal AI web demo with FastAPI.
- Quantized deployment: Maximize efficiency and minimize resource consumption using GGUF, BNB, and AWQ.
- Edge devices: Bring powerful AI experiences to iPhone and iPad, supporting offline and privacy-sensitive applications.
Explore real-world examples of MiniCPM-V deployed on edge devices using our curated recipes. These demos highlight the modelβs high efficiency and robust performance in practical scenarios.
- Run locally on iPhone with iOS demo.
- Run locally on iPad with iOS demo, observing the process of drawing a rabbit.
ipad_case.mp4
Ready-to-run examples
| Recipe | Description |
|---|---|
| Vision Capabilities | |
| πΌοΈ Single-image QA | Question answering on a single image |
| π§© Multi-image QA | Question answering with multiple images |
| π¬ Video QA | Video-based question answering |
| π Document Parser | Parse and extract content from PDFs and webpages |
| π Text Recognition | Reliable OCR for photos and screenshots |
| Audio Capabilities | |
| π€ Speech-to-Text | Multilingual speech recognition |
| π£οΈ Text-to-Speech | Instruction-following speech synthesis |
| π Voice Cloning | Realistic voice cloning and role-play |
Customize your model with your own ingredients
Data preparation
Follow the guidance to set up your training datasets.
Training
We provide training methods serving different needs as following:
| Framework | Description |
|---|---|
| Transformers | Most flexible for customization |
| LLaMA-Factory | Modular fine-tuning toolkit |
| SWIFT | Lightweight and fast parameter-efficient tuning |
| Align-anything | Visual instruction alignment for multimodal models |
Deploy your model efficiently
| Method | Description |
|---|---|
| vLLM | High-throughput GPU inference |
| SGLang | High-throughput GPU inference |
| Llama.cpp | Fast CPU inference on PC, iPhone and iPad |
| Ollama | User-friendly setup |
| OpenWebUI | Interactive Web demo with Open WebUI |
| Gradio | Interactive Web demo with Gradio |
| FastAPI | Interactive Omni Streaming demo with FastAPI |
| iOS | Interactive iOS demo with llama.cpp |
Compress your model to improve efficiency
| Format | Key Feature |
|---|---|
| GGUF | Simplest and most portable format |
| BNB | Simple and easy-to-use quantization method |
| AWQ | High-performance quantization for efficient inference |
| Category | Framework | Cookbook Link | Upstream PR | Supported since (branch) | Supported since (release) |
|---|---|---|---|---|---|
| Edge (On-device) | Llama.cpp | Llama.cpp Doc | #15575 (2025-08-26) | master (2025-08-26) | b6282 |
| Ollama | Ollama Doc | #12078 (2025-08-26) | Merging | Waiting for official release | |
| Serving (Cloud) | vLLM | vLLM Doc | #23586 (2025-08-26) | main (2025-08-27) | Waiting for official release |
| SGLang | SGLang Doc | #9610 (2025-08-26) | Merging | Waiting for official release | |
| Finetuning | LLaMA-Factory | LLaMA-Factory Doc | #9022 (2025-08-26) | main (2025-08-26) | Waiting for official release |
| Quantization | GGUF | GGUF Doc | β | β | β |
| BNB | BNB Doc | β | β | β | |
| AWQ | AWQ Doc | β | β | β | |
| Demos | Gradio Demo | Gradio Demo Doc | β | β | β |
If you'd like us to prioritize support for another open-source framework, please let us know via this short form.
- text-extract-api: Document extraction API using OCRs and Ollama supported models
- comfyui_LLM_party: Build LLM workflows and integrate into existing image workflows
- Ollama-OCR: OCR package uses vlms through Ollama to extract text from images and PDF
- comfyui-mixlab-nodes: ComfyUI node suite supports Workflow-to-APPγGPT&3D and more
- OpenAvatarChat: Interactive digital human conversation implementation on single PC
- pensieve: A privacy-focused passive recording project by recording screen content
- paperless-gpt: Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR
- Neuro: A recreation of Neuro-Sama, but running on local models on consumer hardware
We love new recipes! Please share your creative dishes:
- Fork the repository
- Create your recipe
- Submit a pull request
- Found a bug? Open an issue
- Need help? Join our Discord
This cookbook is developed by OpenBMB and OpenSQZ.
This cookbook is served under the Apache-2.0 License - cook freely, share generously! π³
If you find our model/code/paper helpful, please consider citing our papers π and staring us βοΈοΌ
@article{yao2024minicpm,
title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
journal={arXiv preprint arXiv:2408.01800},
year={2024}
}
