LLaVA inference combining multiple images into one for streamlined processing and cross-image analysis.
- You should follow the LLaVA tutorial, so that you have the pretrained model / checkpoint shards ready.
- Then,
cd
into your LLaVA root directory. - Clone my repo (and optionally remove the test-images):
git clone https://github.com/mapluisch/LLaVA-CLI-with-multiple-images.git && \
(cd LLaVA-CLI-with-multiple-images && \
rm -rf test-images && \
cp -a . ../) && \
rm -rf LLaVA-CLI-with-multiple-images
This command simply clones the repo, removes the test-images folder, copies all the files into the actual working directory (your LLaVA root directory), and finally removes the repo's directory.
While in your LLaVA directory, first activate the conda environment via conda activate llava
.
Then, simply call my script via python
or python3
with your preferred arguments.
python llava-multi-images.py [ARGS]
Given that this project is based on LLaVA's cli.py
, the following base arguments can be specified:
--model-path, default="liuhaotian/llava-v1.6-vicuna-13b"
--model-base, default=None
--device, default="cuda"
--conv-mode, default=None
--temperature, default=0.2
--max-new-tokens, default=512
--load-8bit, action="store_true"
--load-4bit, action="store_true"
--debug, action="store_true"
Additionally added args:
--images
--save-image, action="store_true"
--concat-strategy, default="vertical", choices=["vertical", "horizontal", "grid"]
--dist-images, default=20
--grid-resolution, default=2560,1440
Using --images /some/img1.jpg /some/img2.jpg /some/img_n.jpg
, or alternatively --images /some/img{1-n}.jpg
if the images have the same location and prefix, you can specify as many images as you want for inference. These input images will get concatenated using PIL.
Using --save-image
, the resulting concatenated image gets stored in the LLaVA directory as concat-image.jpg
.
Using --concat-strategy
, you can specify the arrangement of the concatenated images (see Examples).
Using --dist-images
, you can specify the spacing between images.
Using --grid-resolution
, you can specify the output image's resolution when using grid placement.
For these two input images, img1.jpg
& img2.jpg
:
python3 llava-multi-images.py --load-4bit --save-image --images test-images/img1.jpg test-images/img2.jpg --dist-images 250 --concat-strategy vertical
results in this concatenated image being used for LLaVA's CLI conversation (empty space in the lower right due to resolution mismatch):
LLaVA can now answer questions about the merged image, understanding that it consists of separate images:
USER: Analyze and explain which image is the most visually intriguing.
ASSISTANT: Both images are visually striking, but they present different types of scenes and evoke different emotions.
The top image captures a dramatic and serene landscape. It features a snow-covered mountain under a sky with the Aurora Borealis, also known as the Northern Lights. The aurora creates a vibrant, almost otherworldly glow that contrasts with the cool, white snow on the mountain. This image evokes a sense of tranquility and awe at the natural beauty of the scene.
The bottom image shows a solitary tree in a field with dry grass. The tree stands out against the flat landscape, and the dry grass suggests a season like autumn or early spring. This image conveys a sense of solitude and the quiet beauty of nature.
Both images are visually intriguing, but they appeal to different aspects of the viewer's experience. The top image is more dramatic and otherworldly, while the bottom image is more grounded and evokes a sense of solitude. The choice of which image is more visually intriguing would depend on personal preference and the emotions one wishes to evoke.
Given these two input images, img3.jpg
& img4.jpg
:
python3 llava-multi-images.py --load-4bit --save-image --images test-images/img3.jpg test-images/img4.jpg --dist-images 250 --concat-strategy horizontal
LLaVA receives this image (again, empty space in the lower right due to resolution mismatch):
...and can once again answer promptly:
USER: Describe the main parts of each image.
ASSISTANT: The image on the left shows a building with a sign that reads "HOTEL" prominently displayed. The building has a distinctive architectural style with a tower-like structure and a series of windows. The sky is overcast, and the lighting suggests it could be early morning or late afternoon.
The image on the right depicts a road lined with trees on both sides, creating a tunnel-like effect. The road appears to be a highway or a major road, given the presence of a guardrail and the absence of any buildings or structures close to the road. The trees are lush and green, indicating that the photo was likely taken during a time of year when trees are in full leaf. The sky is clear and blue, suggesting a bright and sunny day.
Finally, given e.g. nine input images, in this case img{5-13}.jpg
:
python3 llava-multi-images.py --load-4bit --save-image --images test-images/img{5-13}.jpg --dist-images 150 --concat-strategy grid --temperature 0.7
LLaVA receives this new image with a size of --grid-resolution
, which defaults to 2560x1440, consisting of a grid of images:
Again, you can now pose questions to LLaVA, e.g., whether it can tell us a short rhyme about the different cars:
USER: Write a rhyme about how many different cars you can see.
ASSISTANT: In this collection, you'll see quite a few,
A variety of wheels, with styles to pursue.
Nine in total, each unique and grand,
Ranging from sports to luxury and beyond.Some are sleek, some are chunky, some are fast,
All with designs that'll leave you amazed.
Brand names that shine like stars at night,
Mercedes, Audi, and BMW, they all take flight.They're parked and zooming, on dirt and on tar,
Each model a masterpiece, a work of art.
From city streets to mountain paths,
These cars are ready to conquer every stretch.So here's to the beauty and the power,
To the speed and style, and every hour.
For each model, a story, each one's a dream,
In this collage, cars are the stars of the scene.
This project is a prototype and serves as a basic example of using LLaVA CLI inference with multiple images at once. I have not tested this extensively - I've tried both LLaVA v1.5 and v1.6 13b with 4-bit quantization. Results may and probably will vary depending on the model and quantization you choose. Feel free to create a PR.