Models to port to MLX-VLM
Blaizzy opened this issue ยท 20 comments
- MiniCPM-Llama3-V-2_5
- Florence 2
- Phi-3-vision
- Bunny
- Dolphi-vision-72b
- Llava Next
- Qwen2-VL
- Pixtral
- Idefics 3
- Llava Interleave
- Llava onevision
- internlm-xcomposer2d5-7b
- InternVL
- CogVLM2
- Copali
- MoonDream2
- Yi-VL
- CuMo
- Kosmos-2.5
- Molmo
- Llama-3.2
- Ovis Gemma
- Aria
- NVIDIA NVLM
- GOT
Instructions:
- Select the model and comment below with your selection
- Create a Draft PR titled: "Add support for X"
- Read Contribution guide
- Check existing models
- Tag @Blaizzy for code reviews and questions.
If the model you want is not listed, please suggest it and I will add it.
Next release of Llava-Next
TODO:
update text config defaults to avoid errors with Llava-v1.6-vicuna:
class TextConfig:
model_type: str
hidden_size: int = 4096
num_hidden_layers: int = 32
intermediate_size: int = 11008
num_attention_heads: int = 32
rms_norm_eps: float = 1e-05
vocab_size: int = 32064
num_key_value_heads: int = 32
rope_theta: float = 1000000
rope_traditional: bool = False
rope_scaling: Optional[Dict[str, Union[float, str]]] = None
Thanks for the great repo. This should also be on the list: https://github.com/THUDM/CogVLM2
I am now just reading the code, and trying to free some time for the conversion routine.
MiniCPM-V v2.6
MiniCPM-V v2.6
Do you have a link to Florence-2?
Is the above list the ultimate and up-to-date list of supported models @Blaizzy? Thanks for your hard work!
Hey @ChristianWeyer
Its mostly up-to-date, just missing qwen2-vl
[x] Phi-3-vision
Thanks!
I guess Phi-3-vision includes 3.5?
Yes, they have the same arch so there are no changes needed :)
Hey @Blaizzy, thanks for this great framework. Is there any priority for InternVL? I can see it is present in your list. Just wanted to know if it planned in your near term. Want to make the model run on my macbook and mlx-vlm looks to be the best way for that.
Qwen2-VL-72B would be amazing!
This recipe seems to work for Qwen2-VL-2B-Instruct:
python -m mlx_vlm.generate \
--model Qwen/Qwen2-VL-2B-Instruct \
--max-tokens 100 \
--temp 0.0 \
--image django-roadmap.png \
--prompt "Describe image in detail, include all text"
My results here: https://gist.github.com/simonw/9e02d425cacb902260ec1307e0671e17
Yep they just merged Qwen2-vl support this weekend.
Molmo please
Nvidia just dropped multimodal NVLM-D-72B. The benchmark looks pretty good.
Yap, that's a pretty awesome model!
It's on my radar because we can run it in 4bit quant