Why there is prob_if_in/out in MoE-Loss-load?

Question

Why there is prob_if_in/out in MoE-Loss-load?

Closed this issue 3 years ago · 4 comments

Can you explain that there is "prob_if_in" , "prob_if_out" in "MoE._prob_in_top_k"?
I am a little confused that the original paper doesn't talk about the L_load needs two kinds of prob?

Answer 1 · 2021-12-17T10:05:12.000Z

Hi! Please see the description of the function:

Computes the probability that value is in top k, given different random noise. This gives us a way of backpropagating from a loss that balances the number of times each expert is in the top k experts per example. In the case of no noise, pass in None for noise_stddev, and the result will not be differentiable.

What's done here is a trick to make the gating differentiable by adding some noise. The prob_if_in, and prob_if_out are used to check whether the gate with added random noise is still activated. Hope this helps.

Answer 2 · 2021-12-19T12:37:42.000Z

Thank you so much!
But when I run the code on the valid set, as "train=False" in MoE.noisy_top_k_gating,
after

noise_stddev = ((self.softplus(raw_noise_stddev) + noise_epsilon) * train)

the noise_stddev will be zero and be sent to MoE._prob_in_top_k，

prob_if_in = self.normal.cdf((clean_values - threshold_if_in) / noise_stddev)

'/noise_stddev' means '/0' and make an error.

Did I make something wrong or does there exist a mistake?

Answer 3 · 2021-12-19T19:08:05.000Z

Thanks! Good catch, just fixed the evaluation and made it easier to run (+added GPU support). Please let me know if it works for you now!

Answer 4 · 2021-12-21T01:41:47.000Z

It can work pretty well now~