Make mlx-vlm examples in swift

Question

Make mlx-vlm examples in swift

davidkoski opened this issue 3 months ago · 21 comments

davidkoski commented 3 months ago

Consider porting some models from https://github.com/Blaizzy/mlx-vlm to swift

Answer 1 · 2024-09-27T17:59:57.000Z

e.g.

LLaVa llava-hf/LLaVA-NeXT-Video-7B-hf
Qwen2 VL: Qwen/Qwen2-VL-2B-Instruct
Llama 3.2 Vision: meta-llama/Llama-3.2-11B-Vision-Instruct
Phi-3 Vision microsoft/Phi-3-vision-128k-instruct
PaliGemma google/paligemma-3b-mix-224

Answer 2 · 2024-09-30T00:30:33.000Z

Currently, I am working on porting Llama 3.2 VLM to Swift. It would be great if we could make the vlm a separate package so that people can easily pull it down as a dependency and integrate it into their applications, for example, add vlm support for ChatMLX.

Answer 3 · 2024-11-01T10:26:41.000Z

If someone can put together the basic pipeline for one vision model, I can probably port the others to Swift fairly quickly.

Answer 4 · 2024-11-01T15:06:45.000Z

I am working on it right now and have paligemma done (well, not debugged but callable). I am working on how to structure the code with regard to the LLM library -- they should share code where possible.

I will try and put up the branch with what I have today. Next week will be busy so it might be two weeks from now before it is really ready.

Answer 5 · 2024-11-01T15:09:38.000Z

Fantastic, thank you! Once that's in place, I'll start working on some of the other models (and will post here first to avoid duplication of work).

Answer 6 · 2024-11-01T23:06:54.000Z

OK, you can see what I have -- more work to be done but the eval loop is worked out.

#151

Answer 7 · 2024-11-13T22:38:30.000Z

This continues -- I have most of the refactoring done and llm-tool has a hard coded call to paligemma. I need to implement a second VLM (qwen2_vl) so I can make sure I have the right shape for the APIs.

As mentioned before this will be a breaking change in the API (so I will do a major version bump) but it should be pretty easy to adopt. Hopefully a new import and renaming a couple things: I will produce a guide when it is ready.

Answer 8 · 2024-11-13T23:07:46.000Z

Thanks @davidkoski, your work is much appreciated! Once the API is stable, I'll try to port some of the other VLMs.

Answer 9 · 2024-11-21T21:18:02.000Z

@davidkoski @DePasqualeOrg did either of you get qwen 2 vl working in swift?

Answer 10 · 2024-11-21T22:37:28.000Z

It is implemented in the branch right now but still lacks the image processor -- that is what I am starting on next.

Answer 11 · 2024-11-22T13:53:09.000Z

you are doing god's work @davidkoski ! If you need help lmk! Also do you know what would be necessary to go from image processing to video processing?

Answer 12 · 2024-11-22T15:34:13.000Z

@davidkoski Blaizzy/mlx-vlm#97 here is a PR from mlx-vlm that might help!

Answer 13 · 2024-11-22T16:18:17.000Z

Yes, this first version won't have it but it should be straightforward to add. Qwen2VL treats an array of images and a video roughly the same but handles them slightly different in the processor. The video ends up with a different value in the t value (temporal? time?) when it constructs the thw array.