[Question] the 'variable-length attention operator in flash attention'

Question

[Question] the 'variable-length attention operator in flash attention'

Closed this issue 2 months ago · 2 comments

Hi there! It is mentioned in the paper that "variable-length attention operator provided in flash attention (Dao et al., 2022) to compute the attention for each visual input within the batch independently". However, I read the code in here and did not find code related to this variable-length attention operator, and the high-resolution features are encoded with a for loop. Did I miss something?
Thank you!

Answer 1 · 2024-09-20T10:01:31.000Z

Hi, thanks for your interest in our work! We pre-process the input images into a list in the code you implemented and then forward the whole list to OryxViT for batch computation here. The variable-length attention is operated here. Feel free to ask should you have further questions!

Answer 2 · 2024-09-23T09:50:33.000Z

Ah thanks! I read the code again and figured it out.