lucidrains/x-transformers

Sinusoidal embedding order choice different from original definition

gordicaleksa opened this issue · 1 comments

Hey Phil! One more from me! :)

I see that the way you stack sinusoidal embeddings here is different from the original transformer paper (section 3.5).

Instead of:
emb = torch.cat((emb.sin(), emb.cos()), dim = -1)

What the original one does is this:
emb = torch.stack((emb.sin(), emb.cos()), dim=-1).view(max_pos, -1)

i.e. your vector looks like:
[sin(x1),sin(x2),...,cos(x1),cos(x2),...]
whereas in the original paper it was like:
[sin(x1),cos(x1),sin(x2),cos(x2),...]

again, the network will certainly learn from both of these - I was just curious has there been any empirical finding that showed that the first definition is more performant?

hmm not that i know of, but if you run any benchmarks, let me know