[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

Question

[QUESTION] Is it expected to do grad norm on dense-optimizer and moe-optimizer respectively?

Opened this issue 2 months ago · 0 comments

If we enable expert parallelism, there will be two optimizers for dense parameters and expert parameters. When we call optimizer.step() the two optimizers perform grad-norm for their own parameters.

But if we do not enable expert parallelism, all model parameter's grad will be normed as entirely.

So my question is that the behavior of grad norm is different mathematically whether expert parallelism is turned on. Is it expetced?