pkunlp-icler/FastV

Still I dont know why this method can reduce flops after reading code

Closed this issue · 1 comments

I've read the fastV code in https://github.com/pkunlp-icler/FastV/blob/36e71e90c6c8cd5f5de97eebfc2727a83b261327/src/transformers/src/transformers/models/llama/modeling_llama.py

The code only mask trivial img tokens on attention mask, which wont reduce any attention computation. Is there any code change I have not noticed

Sorry I put a wrong code link in the readme, please refer to the latest code in the repo https://github.com/pkunlp-icler/FastV/blob/main/src/transformers/src/transformers/models/llama/modeling_llama.py#L730 . The FASTV_INPLACE method actually drops the hidden states of pruned image tokens instead of the masking method. Also you could check the latency reproduction experiment in https://github.com/pkunlp-icler/FastV?tab=readme-ov-file#latency-experiment-reproduction