davidmrau/mixture-of-experts

Why there is prob_if_in/out in MoE-Loss-load?

Closed this issue · 4 comments

Can you explain that there is "prob_if_in" , "prob_if_out" in "MoE._prob_in_top_k"?
I am a little confused that the original paper doesn't talk about the L_load needs two kinds of prob?

Hi! Please see the description of the function:

Computes the probability that value is in top k, given different random noise. This gives us a way of backpropagating from a loss that balances the number of times each expert is in the top k experts per example. In the case of no noise, pass in None for noise_stddev, and the result will not be differentiable.

What's done here is a trick to make the gating differentiable by adding some noise. The prob_if_in, and prob_if_out are used to check whether the gate with added random noise is still activated. Hope this helps.

Thank you so much!
But when I run the code on the valid set, as "train=False" in MoE.noisy_top_k_gating,
after

noise_stddev = ((self.softplus(raw_noise_stddev) + noise_epsilon) * train)

the noise_stddev will be zero and be sent to MoE._prob_in_top_k

prob_if_in = self.normal.cdf((clean_values - threshold_if_in) / noise_stddev)

'/noise_stddev' means '/0' and make an error.

Did I make something wrong or does there exist a mistake?

Thanks! Good catch, just fixed the evaluation and made it easier to run (+added GPU support). Please let me know if it works for you now!

It can work pretty well now~