Large Multimodal Models (LMMs) are being used for many purposes, and image captioning is one of them. In order to understand the captioning performance and runtime of some known models, I wrote a notebook with a Gradio interface. If you have some domain specific images and want to know how they will be captioned, give it a try!
The models used here belong to their authors, please refer to their respective works:
Consider creating an environment like below (recommended, optional):
conda create -n lmm python=3.9
conda activate lmm
Clone the repository, then install dependencies
git clone https://github.com/IceTea42/caption-with-llms
cd caption-with-llms
pip install -r requirements.txt
Next to this repository, install DeepSeek-VL as such:
git clone https://github.com/deepseek-ai/DeepSeek-VL
cd DeepSeek-VL
pip install -e .
By its comparative nature, this repository builds upon MoonDream and DeepSeek-VL