Position Encoding

Question

Position Encoding

Closed this issue a year ago · 2 comments

div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)) # (d_model / 2)
# Apply sine to even indices
e[:, 0::2] = torch.sin(position * div_term) # sin(position * (10000 ** (2i / d_model))

$$sin(\frac{pos}{10000^{\frac{2i}{d_model}}}))$$
In the last line the comment says its: sin(position * (10000 ** (2i / d_model)), shouldnt it be sin(position / (10000 ** (2i / d_model)).

Answer 1 · 2023-06-24T02:43:16.000Z

Hi Sheiphan,

I took the positional encoding code form The Annotated Transformer article at Harvard. Anyway, as long as the frequency of the sines and cosines represents a pattern, the model should be able to learn it. We don't have to use the pattern written in the paper, as a matter of fact, there are many techniques for encoding positions, include relative positions instead of absolute.

You can replace the formula and calculate it in the "vanilla way" just like in the original paper, without using log or exp and it will work fine as well. The reason we prefer doing it in log space is for numerical stability (because we don't like dealing with very small or very big numbers). This article explains why we prefer log operations in CS.

If you want to learn about relative positional encodings, I suggest the following paper: https://arxiv.org/pdf/1803.02155.pdf

Have a nice day!

Answer 2 · 2023-12-07T17:09:59.000Z

@Sheiphan

I also got confused by this, but I realized there is a negative sign inside the exp, before math.log.
(-math.log(10000.0) / d_model). So if there is a negative sign inside an exponent, basically it means it's an inverse. For example $10^{-1}$ would be $\frac{1}{10}$. So since the div_term itself is an inverse term, we can directly do position * div_tem