This tool uses the Eagle-X5-7B model from NVIDIA to generate keyword-based captions for images in an input folder. Special thanks to NVIDIA for training this powerful model.
It's a fast and robust captioning model that produces comma-separated keyword outputs.
-
Python 3.10 or above.
- It's been tested with 3.10, 3.11 and 3.12.
- It does not work with 3.8.
-
Cuda 12.1.
- It may work with other versions. Untested.
To use CUDA / GPU speed captioning, you'll need ~6GB VRAM or more.
- Create a virtual environment. Use the included
venv_create.bat
to automatically create it. Use python 3.10 or above. - Install the libraries in requirements.txt.
pip install -r requirements.txt
. This is done by step 1 when asked if you usevenv_create
. - Install Pytorch for your version of CUDA. It's only been tested with version 12.1 but may work with others.
- Open
batch.py
in a text editor and change the BATCH_SIZE = 7 value to match the level of your GPU.
For a 6gb VRAM GPU, use 1. For a 24gb VRAM GPU, use 7.
- Activate the virtual environment. If you installed with
venv_create.bat
, you can runvenv_activate.bat
. - Run
python batch.py
from the virtual environment.
This runs captioning on all images in the /input/-folder.
Thanks to MNeMoNiCuZ for the original script upon which this one is based, and Gökay Aydoğan for additional script support.