This repo provides a simple GUI application for using the Qwen2-VL-7B-Captioner-Relaxed model for image captioning. It is
- Upload an image and generate a caption.
- Select from predefined system prompts or enter a custom prompt.
- Automatically load and initialize the model on the first caption generation.
- Python 3.9 or higher
- pip (Python package installer)
- GPU with CUDA support with at least of 16GB VRAM
- 16 GB of storage for the model weights
-
Clone the repository:
git clone https://github.com/ertugrul-dmr/qwen2vl-captioner-gui.git
cd qwen2vl-captioner-gui
-
Install the required packages:
python3 -m venv venv source venv/bin/activate pip3 install torch pip3 install -r requirements.txt
python -m venv venv .\venv\Scripts\activate pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 pip install -r requirements.txt
-
Run the
app.py
script:python app.py
-
Open your web browser and go to
http://localhost:7860
to access the Gradio interface.
- Upload an image by clicking on the image input box.
- Select a system prompt from the dropdown menu or enter a custom prompt or leave as default.
- Click the "Generate Caption" button to generate a caption for the image.
- The generated caption will be displayed in the "Generated Caption" textbox.
This project is licensed under the MIT License. See the LICENSE file for details.