mit-han-lab/lite-transformer

in the paragra 4 of paper

sanwei111 opened this issue · 1 comments

It can be easily distinguished that instead of attempting to model both global and local contexts, the attention module in LSRA only focuses on the global contexts capture (no diagonal pattern), leaving the local contexts capture to the convolution branch

I JUST WONDER that the attention branch is incharge of the global features with original attention module why " (no diagonal pattern)"???

Hi @sanwei111, thank you for asking. The diagonal patterns are the local contexts that can be captured by the conv branch. The LSRA implicitly learns to specialize the tasks for the two branches, having the attention branch capturing the global context and leaving the diagonal patterns to the conv branch.