Vision-Transformer-from-scratch-using-external-attention-for-classification

paper: https://arxiv.org/pdf/2105.02358.pdf

image

Algorithm

F = query _linear(F) # s h a p e = (B , N, C)

attn = M_k( F ) # s h a p e = (B , N, M)

attn = softmax (attn , dim=1 )

attn = l1_norm ( attn , dim=2 )

out = M_v( attn ) # s h a p e = (B , N, C)

image

external attention, which computes attention between the input pixels and an external memory unit M ∈ R^(S×d), via:

A = (α)i,j = Norm(FM)

Fout = AM.

A is the attention map inferred from this learned dataset-level prior knowledge; Finally, we update the input features from M by the similarities in A

The computational complexity of external attention is O(dSN); as d and S are hyper-parameters