/mllm_know

Primary LanguagePython

MLLMs Know Where to Look:
Training-free Perception of Small Visual Details with Multimodal LLMs

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, Filip Ilievski

Method Overview

arXiv OpenReview ICLR 2025

πŸ“‹ Overview

This repository contains the official implementation of our ICLR 2025 paper "MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs". Our method enables multimodal large language models (MLLMs) to better perceive small visual details without any additional training. This repository provides the detailed implementation of applying our methods on multiple MLLMs and benchmark datasets.

πŸ”₯ Highlights

  • πŸ” We find that MLLMs often know where to look, even if their answers are wrong.
  • πŸ“Έ We propose a training-free method to significantly enhance MLLMs' visual perception on small visual details.
  • πŸ’ͺ Our method is flexible with different visual inputs formats, including high-resolution images (see below), multiple images, and video (to be explored in the future).

πŸ› οΈ Installation

Setup Environment

# Create and activate conda environment
conda create -n mllms_know python=3.10
conda activate mllms_know

# Install dependencies
pip install -r requirements.txt

# Install modified transformers library
cd transformers
pip install -e .
cd ..

πŸš€ Quick Start

We provide a quick start notebook that demonstrates how to:

  • Load and process images
  • Apply our methods to enhance visual perception
  • Visualize attention maps

πŸ“Š Benchmark Evaluation

Dataset Preparation

  1. Download the benchmark datasets and corresponding images to your local directory
  2. Update the paths in info.py with your local directory paths

Example (textvqa)

Dataset preparation:

mkdir -p data/textvqa/images
wget https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip -P data/textvqa/images
unzip data/textvqa/images/train_val_images.zip -d data/textvqa/images
rm data/textvqa/images/train_val_images.zip
mv data/textvqa/images/train_images/* data/textvqa/images
rm -r data/textvqa/images/train_images
wget https://dl.fbaipublicfiles.com/textvqa/data/TextVQA_0.5.1_val.json -P data/textvqa

Dataset processing (to a unified format):

import json

with open('data/textvqa/TextVQA_0.5.1_val.json') as f:
    datas = json.load(f)

new_datas = []
for data_id, data in enumerate(datas['data']):
    data_id = str(data_id).zfill(10)
    question = data['question']
    labels = data['answers']
    image_path = f"{data['image_id']}.jpg"
    new_data = {
        'id': data_id,
        'question': question,
        'labels': labels,
        'image_path': image_path
    }
    new_datas.append(new_data)

with open('data/textvqa/data.json', 'w') as f:
    json.dump(new_datas, f, indent=4)

Running Evaluations

To run our method on benchmark datasets, use the provided script:

# Format: bash run_all.sh [dataset] [model] [method]
bash run_all.sh textvqa llava rel_att

Get the model's performance:

python get_score.py --data_dir ./data/results --save_path ./

Datasets Links

Models

  • LLaVA-1.5 (llava)
  • InstructBLIP (blip)

For implementation details, see llava_methods.py and blip_methods.py. Please feel free to explore other MLLMs!

πŸ“ Method Details

Our approach leverages inherent attention mechanisms and gradients in MLLMs to identify regions of interest without additional training. The key methods include:

  1. Relative Attention-based Visual Cropping: Computes relative attention (A_{rel}(x,q)) for each image-question pair and selects a target layer from TextVQA validation data to guide visual cropping.

  2. Gradient-Weighted Attention-based Visual Cropping: Uses gradient information to refine attention maps, normalizing answer-to-token and token-to-image attention without requiring a second forward pass.

  3. Input Gradient-based Visual Cropping: Directly computes the gradient of the model’s decision w.r.t. the input image. To mitigate noise in uniform regions, it applies Gaussian high-pass filtering, median filtering, and thresholding before spatial aggregation.

Bounding Box Selection for Visual Cropping.
We use a sliding window approach to extract bounding boxes from the importance map. Windows of different sizes, scaled by factors in ${1, 1.2, \dots, 2}$, slide over the image with a stride of 1. The position maximizing the sum of importance values is selected, and the window with the largest deviation from its neighbors is chosen. The cropped region is then resized and fed into the MLLM.

High-Resolution Visual Cropping.
For high-resolution images ($>1K$), we first split them into smaller non-overlapping blocks ($<1024\times1024$), compute importance maps for each block, and merge them. The same bounding box selection is then applied to the merged importance map.

For implementation details, see llava_methods.py and blip_methods.py and utils.py.

πŸ“Š Results

Our method significantly improves MLLMs' performance on tasks requiring perception of small visual details, such as text recognition in images, fine-grained object recognition, and spatial reasoning. Please refer to the paper for more details and run the demo notebook for better understanding!

πŸ“š Citation

If you find our paper and code useful for your research and applications, please cite using this BibTeX:

@article{zhang2025mllms,
  title={MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs},
  author={Zhang, Jiarui and Khayatkhoei, Mahyar and Chhikara, Prateek and Ilievski, Filip},
  journal={arXiv preprint arXiv:2502.17422},
  year={2025}
}

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.