WD Llava Caption

This repo is for an experiment I have been doing with llava 1.6 where I put descriptive tags in to the prompt and it seems to yield much better and accurate results about an image.

Installation

Clone the repository and navigate to the directory:

git clone https://github.com/ausboss/wd-llava-caption.git
cd wd-llava-caption-main

Install the required dependencies:
```
pip install -r requirements.txt
```

Usage

Run the server:
```
python server.py
```
Open the Jupyter notebook:
```
jupyter notebook notebook.ipynb
```

Environment Variables

Set the environment variables as needed using sample.env as a template. You will need a Hugging Face token to download the model for generating tags. You can get a token by logging in and generating one here.

Additional Requirements

Install Ollama:

Follow the instructions at https://ollama.com/

Pull the LLAVA model that fits your VRAM requirements:
```
ollama pull <model name>
```
- llava1.6 34b: 20GB
- llava1.6 13b: 8GB
More LLAVA 1.6 models can be found here.

After installing Ollama and pulling the model update the ollama_model variable in server.py to match what you pulled.

Files

server.py: Start the server.
notebook.ipynb: Jupyter notebook for image captioning.
requirements.txt: Dependencies list.
sample.env: Environment variables template.