/FastV

Code for paper: An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Primary LanguagePython

Please move to https://github.com/pkunlp-icler/FastV for latest updates

FastV

Static Badge

FastV is a plug-and-play inference acceleration method for large vision language models relying on visual tokens. It could reach 45% theoretical FLOPs reduction without harming the performance through pruning redundant visual tokens in deep layers.


Latest News 🔥

  • LVLM Inefficent Visual Attention Visualization Code
  • FastV demo inference code
  • Graido visualization demo
  • Integrate FastV to LLM inference framework

Stay tuned!

Setup

conda create -n fastv python=3.10
conda activate fastv
cd src
bash setup.sh

Visualization: Inefficient Attention over Visual Tokens

we provide a script (./src/FastV/inference/visualization.sh) to reproduce the visualization result of each LLaVA model layer for a given image and prompt.

bash ./src/FastV/inference/visualization.sh

or

python ./src/FastV/inference/plot_inefficient_attention.py \
    --model-path "PATH-to-HF-LLaVA1.5-Checkpoints" \
    --image-path "./src/LLaVA/images/llava_logo.png" \
    --prompt "Describe the image in details."\
    --output-path "./output_example"\

Model output and attention maps for different layers would be stored at "./output_example"

FastV demo inference code

We provide code to reproduce the ablation study on K and R values, as shown in figure-7 in the paper. This implementation masks out the discarded tokens in deep layers for convenience. Inplace token dropping feature would be added in LLM inference framework section.

ocrvqa

bash ./src/FastV/inference/eval/eval_ocrvqa.sh

Results

Citation

@misc{chen2024image,
      title={An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models}, 
      author={Liang Chen and Haozhe Zhao and Tianyu Liu and Shuai Bai and Junyang Lin and Chang Zhou and Baobao Chang},
      year={2024},
      eprint={2403.06764},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}