facebookresearch/mega

Regarding the damping factor δ

Closed this issue · 1 comments

ciaua commented

Hello,

I am not sure if this is a bug, but the following is a question about the damping factor δ after I read the paper and the code.

In the paper, it says that "MEGA allows the damping of the influence of the previous time step". I assume it means that δ is for damping of the influence of the previous time step.

The formula (3) in the paper is: $y_t = α \odot x_t + (1 − α \odot δ) \odot y_{t−1}$
But if δ is for damping the previous time steps, shouldn't it be $y_t = α \odot x_t + ((1 − α) \odot δ) \odot y_{t−1}$ ?

With the formula (3), δ is between 0 and 1 so $(1 − α \odot δ) >= (1 - α)$, which will enhance the influence of the previous time step instead of damping it. I thought it is a misplacement of parentheses in the paper. However, I checked the code in this repo and found that formula (3) is also used in the code (https://github.com/facebookresearch/mega/blob/main/fairseq/modules/exponential_moving_average.py#L79), so formula (3) seems to be the intended formula.

Could you clarify if formula (3) is the correct formula? If so, why does it have damping effect for the previous time steps?
Thanks!

Thanks for pointing this out! The equation is correct, but our description in the paper is problematic. In fact, we want to use $\delta$ to dampen the influence of $\alpha$ on the previous step.