Image Input

Question

Image Input

Opened this issue 8 months ago · 3 comments

Describe the feature

hello.
llamafile seems to have image input functions such as jpg/png/gif/bmp.

Example)
llamafile -ngl 9999 --temp 0
--image ~/Pictures/lemurs.jpg
-m llava-v1.5-7b-Q4_K.gguf
--mmproj llava-v1.5-7b-mmproj-Q4_0.gguf
-e -p '### User: What do you see?\n### Assistant: '
--no-display-prompt 2>/dev/null

Is it possible to implement this feature in the future?
Or is there some problem that makes it impossible?

Answer 1 · 2024-03-29T08:01:18.000Z

hi, thanks for the request!
that should be feasible.
how would you like to use it / see it inside Unity?

Answer 2 · 2024-03-29T08:33:17.000Z

Eventually, I would like to add the ability to describe to the user what the NPC character's camera (eyes) sees.

I haven't tested this against the local vision model yet, so I don't know to what extent it's possible, but it would be interesting if it were!

Answer 3 · 2024-09-04T17:07:51.000Z

I implemented most of the functionality in this branch: feature/multimodal_models
and I afterwards figured out that multimodal support has been dropped from the llama.cpp server and not brought back for the last months: ggerganov/llama.cpp#8010 😞