Still I dont know why this method can reduce flops after reading code

Question

Closed this issue 5 months ago · 1 comments

The code only mask trivial img tokens on attention mask, which wont reduce any attention computation. Is there any code change I have not noticed

Answer 1 · 2024-03-19T11:13:53.000Z

Sorry I put a wrong code link in the readme, please refer to the latest code in the repo https://github.com/pkunlp-icler/FastV/blob/main/src/transformers/src/transformers/models/llama/modeling_llama.py#L730 . The FASTV_INPLACE method actually drops the hidden states of pruned image tokens instead of the masking method. Also you could check the latency reproduction experiment in https://github.com/pkunlp-icler/FastV?tab=readme-ov-file#latency-experiment-reproduction