Does attention mask reduce computation cost?
brotherb opened this issue · 1 comments
brotherb commented
Hey, there.
After I read the code, I am confused that the computation cost can be reduced by mask more tokens. Did I miss anything?
PS. I see the FLOPS is calculated by the length of tokens retented at each layer which is counted during inference.
Do you have inference latency metrics?