🍳 MiniCPM-V & o Cookbook

🏠 Main Repository | 📚 Full Documentation

Cook up amazing multimodal AI applications effortlessly with MiniCPM-o, bringing vision, speech, and live-streaming capabilities right to your fingertips!

✨ What Makes Our Recipes Special?

Easy Usage Documentation

Our comprehensive documentation website presents every recipe in a clear, well-organized manner. All features are displayed at a glance, making it easy for you to quickly find exactly what you need.

Broad User Spectrum

We support a wide range of users, from individuals to enterprises and researchers.

Individuals: Enjoy effortless inference using Ollama and Llama.cpp with minimal setup.
Enterprises: Achieve high-throughput, scalable performance with vLLM and SGLang.
Researchers: Leverage advanced frameworks including Transformers , LLaMA-Factory, SWIFT, and Align-anything to enable flexible model development and cutting-edge experimentation.

Versatile Deployment Scenarios

Our ecosystem delivers optimal solution for a variety of hardware environments and deployment demands.

Web demo: Launch interactive multimodal AI web demo with FastAPI.
Quantized deployment: Maximize efficiency and minimize resource consumption using GGUF, BNB, and AWQ.
Edge devices: Bring powerful AI experiences to iPhone and iPad, supporting offline and privacy-sensitive applications.

⭐️ Live Demonstrations

Explore real-world examples of MiniCPM-V deployed on edge devices using our curated recipes. These demos highlight the model’s high efficiency and robust performance in practical scenarios.

Run locally on iPhone with iOS demo.

Run locally on iPad with iOS demo, observing the process of drawing a rabbit.

ipad_case.mp4

🔥 Inference Recipes

Ready-to-run examples

Recipe	Description
Vision Capabilities
🖼️ Single-image QA	Question answering on a single image
🧩 Multi-image QA	Question answering with multiple images
🎬 Video QA	Video-based question answering
📄 Document Parser	Parse and extract content from PDFs and webpages
📝 Text Recognition	Reliable OCR for photos and screenshots
Audio Capabilities
🎤 Speech-to-Text	Multilingual speech recognition
🗣️ Text-to-Speech	Instruction-following speech synthesis
🎭 Voice Cloning	Realistic voice cloning and role-play

🏋️ Fine-tuning Recipes

Customize your model with your own ingredients

Data preparation

Follow the guidance to set up your training datasets.

Training

We provide training methods serving different needs as following:

Framework	Description
Transformers	Most flexible for customization
LLaMA-Factory	Modular fine-tuning toolkit
SWIFT	Lightweight and fast parameter-efficient tuning
Align-anything	Visual instruction alignment for multimodal models

📦 Serving Recipes

Deploy your model efficiently

Method	Description
vLLM	High-throughput GPU inference
SGLang	High-throughput GPU inference
Llama.cpp	Fast CPU inference on PC, iPhone and iPad
Ollama	User-friendly setup
OpenWebUI	Interactive Web demo with Open WebUI
Gradio	Interactive Web demo with Gradio
FastAPI	Interactive Omni Streaming demo with FastAPI
iOS	Interactive iOS demo with llama.cpp

🥄 Quantization Recipes

Compress your model to improve efficiency

Format	Key Feature
GGUF	Simplest and most portable format
BNB	Simple and easy-to-use quantization method
AWQ	High-performance quantization for efficient inference

Framework Support Matrix

Category	Framework	Cookbook Link	Upstream PR	Supported since (branch)	Supported since (release)
Edge (On-device)	Llama.cpp	Llama.cpp Doc	#15575 (2025-08-26)	master (2025-08-26)	b6282
Edge (On-device)	Ollama	Ollama Doc	#12078 (2025-08-26)	Merging	Waiting for official release
Serving (Cloud)	vLLM	vLLM Doc	#23586 (2025-08-26)	main (2025-08-27)	Waiting for official release
Serving (Cloud)	SGLang	SGLang Doc	#9610 (2025-08-26)	Merging	Waiting for official release
Finetuning	LLaMA-Factory	LLaMA-Factory Doc	#9022 (2025-08-26)	main (2025-08-26)	Waiting for official release
Quantization	GGUF	GGUF Doc	—	—	—
	BNB	BNB Doc	—	—	—
	AWQ	AWQ Doc	—	—	—
Demos	Gradio Demo	Gradio Demo Doc	—	—	—

If you'd like us to prioritize support for another open-source framework, please let us know via this short form.

Awesome Works using MiniCPM-V & o

text-extract-api: Document extraction API using OCRs and Ollama supported models
comfyui_LLM_party: Build LLM workflows and integrate into existing image workflows
Ollama-OCR: OCR package uses vlms through Ollama to extract text from images and PDF
comfyui-mixlab-nodes: ComfyUI node suite supports Workflow-to-APP、GPT&3D and more
OpenAvatarChat: Interactive digital human conversation implementation on single PC
pensieve: A privacy-focused passive recording project by recording screen content
paperless-gpt: Use LLMs to handle paperless-ngx, AI-powered titles, tags and OCR
Neuro: A recreation of Neuro-Sama, but running on local models on consumer hardware

👥 Community

Contributing

We love new recipes! Please share your creative dishes:

Fork the repository
Create your recipe
Submit a pull request

Issues & Support

Found a bug? Open an issue
Need help? Join our Discord

Institutions

This cookbook is developed by OpenBMB and OpenSQZ.

📜 License

This cookbook is served under the Apache-2.0 License - cook freely, share generously! 🍳

Citation

If you find our model/code/paper helpful, please consider citing our papers 📝 and staring us ⭐️！

@article{yao2024minicpm,
  title={MiniCPM-V: A GPT-4V Level MLLM on Your Phone},
  author={Yao, Yuan and Yu, Tianyu and Zhang, Ao and Wang, Chongyi and Cui, Junbo and Zhu, Hongji and Cai, Tianchi and Li, Haoyu and Zhao, Weilin and He, Zhihui and others},
  journal={arXiv preprint arXiv:2408.01800},
  year={2024}
}

OpenSQZ/MiniCPM-V-CookBook