a tiny vision language model that kicks ass and runs anywhere
Website | Hugging Face | Demo
moondream2 is a 1.86B parameter model initialized with weights from SigLIP and Phi 1.5.
Model | VQAv2 | GQA | TextVQA | POPE | TallyQA |
---|---|---|---|---|---|
moondream1 | 74.7 | 57.9 | 35.6 | - | - |
moondream2 (latest) | 74.2 | 58.5 | 36.4 | (coming soon) | (coming soon) |
Using transformers (recommended)
pip install transformers timm einops
from transformers import AutoModelForCausalLM, AutoTokenizer
from PIL import Image
model_id = "vikhyatk/moondream2"
revision = "2024-03-05"
model = AutoModelForCausalLM.from_pretrained(
model_id, trust_remote_code=True, revision=revision
)
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
image = Image.open('<IMAGE_PATH>')
enc_image = model.encode_image(image)
print(model.answer_question(enc_image, "Describe this image.", tokenizer))
The model is updated regularly, so we recommend pinning the model version to a specific release as shown above.
Using this repository
Clone this repository and install dependencies.
pip install -r requirements.txt
sample.py
provides a CLI interface for running the model. When the --prompt
argument is not provided, the script will allow you to ask questions interactively.
python sample.py --image [IMAGE_PATH] --prompt [PROMPT]
Use gradio_demo.py
script to start a Gradio interface for the model.
python gradio_demo.py
webcam_gradio_demo.py
provides a Gradio interface for the model that uses your webcam as input and performs inference in real-time.
python webcam_gradio_demo.py
Limitations
- The model may generate inaccurate statements, and struggle to understand intricate or nuanced instructions.
- The model may not be free from societal biases. Users should be aware of this and exercise caution and critical thinking when using the model.
- The model may generate offensive, inappropriate, or hurtful content if it is prompted to do so.