Image Captioning with Gradio

This repo provides a simple GUI application for using the Qwen2-VL-7B-Captioner-Relaxed model for image captioning. It is

Features

Upload an image and generate a caption.
Select from predefined system prompts or enter a custom prompt.
Automatically load and initialize the model on the first caption generation.

Setup

Prerequisites

Python 3.9 or higher
pip (Python package installer)
GPU with CUDA support with at least of 16GB VRAM
16 GB of storage for the model weights

Installation

Clone the repository:

git clone https://github.com/ertugrul-dmr/qwen2vl-captioner-gui.git

cd qwen2vl-captioner-gui

Install the required packages:

For Linux:

python3 -m venv venv
source venv/bin/activate
pip3 install torch
pip3 install -r requirements.txt

For Windows:

python -m venv venv
.\venv\Scripts\activate
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install -r requirements.txt

Running the Application

Run the app.py script:
```
python app.py
```
Open your web browser and go to http://localhost:7860 to access the Gradio interface.

Usage

Upload an image by clicking on the image input box.
Select a system prompt from the dropdown menu or enter a custom prompt or leave as default.
Click the "Generate Caption" button to generate a caption for the image.
The generated caption will be displayed in the "Generated Caption" textbox.

License

This project is licensed under the MIT License. See the LICENSE file for details.

ertugrul-dmr/qwen2vl-captioner-gui