/caption-with-lmms

A tool to gather a selection of Large Multimodal Models for image captioning

Primary LanguageJupyter NotebookApache License 2.0Apache-2.0

A Tool to Gather A Selection of Large Multimodal Models for Image Captioning

python pytorch license


Overview

Large Multimodal Models (LMMs) are being used for many purposes, and image captioning is one of them. In order to understand the captioning performance and runtime of some known models, I wrote a notebook with a Gradio interface. If you have some domain specific images and want to know how they will be captioned, give it a try!

📌 Important note

The models used here belong to their authors, please refer to their respective works:

Installation

Consider creating an environment like below (recommended, optional):

conda create -n lmm python=3.9
conda activate lmm

Clone the repository, then install dependencies

git clone https://github.com/IceTea42/caption-with-llms
cd caption-with-llms
pip install -r requirements.txt

Next to this repository, install DeepSeek-VL as such:

git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip install -e .

Acknowledgement

By its comparative nature, this repository builds upon MoonDream and DeepSeek-VL