Blaizzy/mlx-vlm

Models to port to MLX-VLM

Blaizzy opened this issue ยท 20 comments

  • MiniCPM-Llama3-V-2_5
  • Florence 2
  • Phi-3-vision
  • Bunny
  • Dolphi-vision-72b
  • Llava Next
  • Qwen2-VL
  • Pixtral
  • Idefics 3
  • Llava Interleave
  • Llava onevision
  • internlm-xcomposer2d5-7b
  • InternVL
  • CogVLM2
  • Copali
  • MoonDream2
  • Yi-VL
  • CuMo
  • Kosmos-2.5
  • Molmo
  • Llama-3.2
  • Ovis Gemma
  • Aria
  • NVIDIA NVLM
  • GOT

Instructions:

  1. Select the model and comment below with your selection
  2. Create a Draft PR titled: "Add support for X"
  3. Read Contribution guide
  4. Check existing models
  5. Tag @Blaizzy for code reviews and questions.

If the model you want is not listed, please suggest it and I will add it.

Next release of Llava-Next

TODO:
update text config defaults to avoid errors with Llava-v1.6-vicuna:

class TextConfig:
    model_type: str
    hidden_size: int = 4096
    num_hidden_layers: int = 32
    intermediate_size: int = 11008
    num_attention_heads: int = 32
    rms_norm_eps: float = 1e-05
    vocab_size: int = 32064
    num_key_value_heads: int = 32
    rope_theta: float = 1000000
    rope_traditional: bool = False
    rope_scaling: Optional[Dict[str, Union[float, str]]] = None

Thanks for the great repo. This should also be on the list: https://github.com/THUDM/CogVLM2
I am now just reading the code, and trying to free some time for the conversion routine.

Hey @BoltzmannEntropy and @jrp2014,

Thanks for the suggestions!

I have added them to the backlog

MiniCPM-V v2.6

MiniCPM-V v2.6

Do you have a link to Florence-2?

Is the above list the ultimate and up-to-date list of supported models @Blaizzy? Thanks for your hard work!

Hey @ChristianWeyer
Its mostly up-to-date, just missing qwen2-vl

[x] Phi-3-vision

Thanks!
I guess Phi-3-vision includes 3.5?

Yes, they have the same arch so there are no changes needed :)

Hey @Blaizzy, thanks for this great framework. Is there any priority for InternVL? I can see it is present in your list. Just wanted to know if it planned in your near term. Want to make the model run on my macbook and mlx-vlm looks to be the best way for that.

Qwen2-VL-72B would be amazing!

This recipe seems to work for Qwen2-VL-2B-Instruct:

python -m mlx_vlm.generate \
  --model Qwen/Qwen2-VL-2B-Instruct \
  --max-tokens 100 \
  --temp 0.0 \
  --image django-roadmap.png \
  --prompt "Describe image in detail, include all text"

My results here: https://gist.github.com/simonw/9e02d425cacb902260ec1307e0671e17

Yep they just merged Qwen2-vl support this weekend.

Molmo please

Nvidia just dropped multimodal NVLM-D-72B. The benchmark looks pretty good.

https://huggingface.co/nvidia/NVLM-D-72B

Yap, that's a pretty awesome model!
It's on my radar because we can run it in 4bit quant