Sinusoidal embedding order choice different from original definition
gordicaleksa opened this issue · 1 comments
gordicaleksa commented
Hey Phil! One more from me! :)
I see that the way you stack sinusoidal embeddings here is different from the original transformer paper (section 3.5).
Instead of:
emb = torch.cat((emb.sin(), emb.cos()), dim = -1)
What the original one does is this:
emb = torch.stack((emb.sin(), emb.cos()), dim=-1).view(max_pos, -1)
i.e. your vector looks like:
[sin(x1),sin(x2),...,cos(x1),cos(x2),...]
whereas in the original paper it was like:
[sin(x1),cos(x1),sin(x2),cos(x2),...]
again, the network will certainly learn from both of these - I was just curious has there been any empirical finding that showed that the first definition is more performant?
lucidrains commented
hmm not that i know of, but if you run any benchmarks, let me know