unable to run paligemma
nischalj10 opened this issue · 8 comments
I am trying to run the following code but it is giving error. please assist!
import mlx.core as mx
from mlx_vlm import load, generate
model_path = "google/paligemma-3b-mix-448"
model, processor = load(model_path)
print(processor)
prompt = processor.tokenizer.apply_chat_template(
[{"role": "user", "content": f"<image>\nWhat are these?"}],
tokenize=False,
add_generation_prompt=True,
)
output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
Traceback (most recent call last):
File "/Users/namanjain/Desktop/repos/local-recall/models.py", line 15, in <module>
output = generate(model, processor, "http://images.cocodataset.org/val2017/000000039769.jpg", prompt, verbose=False)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/utils.py", line 809, in generate
logits, cache = model(input_ids, pixel_values, mask)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 139, in __call__
input_embeddings, final_attention_mask_4d = self.get_input_embeddings(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 82, in get_input_embeddings
self._prepare_inputs_for_multimodal(
File "/opt/anaconda3/lib/python3.11/site-packages/mlx_vlm/models/paligemma/paligemma.py", line 115, in _prepare_inputs_for_multimodal
final_embedding[image_mask_expanded] = scaled_image_features.flatten()
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^
ValueError: NumPy boolean array indexing assignment cannot assign 2097152 input values to the 2099200 output values where the mask is true
thanks for the quick revert. i updated the code as suggested.
import mlx.core as mx
from mlx_vlm import load, generate
model_path = "google/paligemma-3b-mix-448"
model, processor = load(model_path)
output = generate(model, processor, "/Users/namanjain/app-data/local-recall/screenshots/1717766288971.png", prompt="describe this screenshot")
print(output)
it takes forever to generate any output. this is not the case with much larger models on my m2 chip. also, the max_tokens param is not configurable and somehow the model generates very few tokens.
Could you share your setup specs?
also, the max_tokens param is not configurable and somehow the model generates very few tokens.
It is configurable, by default it's set to 100 but you can increase it by passing max_tokens
argument to the generate function.
here's my generate function -
output = generate(model, processor, "/Users/namanjain/app-data/local-recall/screenshots/1717766288971.png", prompt="elaborately describe this screenshot. what app or website url is this on?", max_tokens=500)
but the model's response is one worded.
my specs are - M2 Air with 8 GM RAM. However, the GPU isn't fully utilised while inference and there's enough capacity to run the model
A few of things to note about Paligemma:
- Paligemma is not a chat model. It takes single-turn simple instructions and commands (i.e., detect cat, segment cat, Describe this image, What does image show). Read more here.
- You are running the full precision model which even on M3 max runs at 5 tokens/s input and 25 tokens/s. For faster inference, you can use 8bit quant available in the MLX-community repo: https://huggingface.co/mlx-community/paligemma-3b-mix-448-8bit.
- I would recommend the 224x224 model for your machine, instead of the 448x448 because the higher the resolution the more memory it needs to run: https://huggingface.co/mlx-community/paligemma-3b-mix-224-8bit.
Recommended reading: https://huggingface.co/blog/paligemma
@Blaizzy were you able to get these models to output bounding boxes? If I do for either of them something like detect cat as the prompt (on a sample image with 2 cats) it either gives or 2 (which is right but not the bounding box). I saw that elsewhere you were having issues adding the model into the library.
Segmentation seems to work better with lower temperatures.