position-sensitive attention

Thanks for the great work!

I am a bit confused about this piece of code:

Line 67 in fe1d052

kr = torch.einsum('bgci,cij->bgij', k, k_embedding).transpose(2, 3)

According to Eq. 4 in the paper, I have the impression that it should be torch.einsum('bgcj,cij->bgij', k, k_embedding) since p is the varying index. Please correct me if I am wrong. Thanks!

This depends on the varying axis of the embedding you chooce, due to the two axis of the embedding here are two different directions, but both relative.