Why there is prob_if_in/out in MoE-Loss-load?
Closed this issue · 4 comments
Can you explain that there is "prob_if_in" , "prob_if_out" in "MoE._prob_in_top_k"?
I am a little confused that the original paper doesn't talk about the L_load needs two kinds of prob?
Hi! Please see the description of the function:
Computes the probability that value is in top k, given different random noise. This gives us a way of backpropagating from a loss that balances the number of times each expert is in the top k experts per example. In the case of no noise, pass in None for noise_stddev, and the result will not be differentiable.
What's done here is a trick to make the gating differentiable by adding some noise. The prob_if_in, and prob_if_out are used to check whether the gate with added random noise is still activated. Hope this helps.
Thank you so much!
But when I run the code on the valid set, as "train=False" in MoE.noisy_top_k_gating,
after
noise_stddev = ((self.softplus(raw_noise_stddev) + noise_epsilon) * train)
the noise_stddev will be zero and be sent to MoE._prob_in_top_k,
prob_if_in = self.normal.cdf((clean_values - threshold_if_in) / noise_stddev)
'/noise_stddev' means '/0' and make an error.
Did I make something wrong or does there exist a mistake?
Thanks! Good catch, just fixed the evaluation and made it easier to run (+added GPU support). Please let me know if it works for you now!
It can work pretty well now~