An example of LLaVA: Large Language and Vision Assistant in MLX.1 LLlava is a multimodal model that can generate text given combined image and text inputs.
Install the dependencies:
pip install -r requirements.txt
You can use LLaVA to ask questions about images.
For example, using the command line:
python generate.py \
--model llava-hf/llava-1.5-7b-hf \
--image "http://images.cocodataset.org/val2017/000000039769.jpg" \
--prompt "USER: <image>\nWhat are these?\nASSISTANT:" \
--max-tokens 128 \
--temp 0
This uses the following image:
And generates the output:
These are two cats lying on a pink couch.
You can also use LLaVA in Python:
from generate import load_model, prepare_inputs, generate_text
processor, model = load_model("llava-hf/llava-1.5-7b-hf")
max_tokens, temperature = 128, 0.0
prompt = "USER: <image>\nWhat are these?\nASSISTANT:"
image = "http://images.cocodataset.org/val2017/000000039769.jpg"
input_ids, pixel_values = prepare_inputs(processor, image, prompt)
reply = generate_text(
input_ids, pixel_values, model, processor, max_tokens, temperature
)
print(reply)
Footnotes
-
Refer to LLaVA project webpage for more information. ↩