davidmrau/mixture-of-experts
PyTorch Re-Implementation of "The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer et al. https://arxiv.org/abs/1701.06538
PythonGPL-3.0
Issues
- 3
A question for changing input size of moe
#28 opened by jhxu003 - 0
For def _prob_in_top_k
#31 opened by Brankozz - 1
MoE for transformers
#24 opened by elias-ramzi - 10
Zero Grad of w_gate
#27 opened by panmianzhi - 0
requires_grad = True not required for a variable under combine() method?
#26 opened by doppiomovimento - 4
have you ever meet such trend of loss ?
#22 opened by lonelyqian - 0
How to use this layer in a sequence setting?
#25 opened by agupta54 - 0
some questions about the code
#23 opened by hanruisong00 - 1
- 1
Why not gpu?
#17 opened by chengjiaxiangbytedance - 4
- 1
cv_squared
#2 opened by caoshijie0501 - 1
why apply exp() log() in expert_out result in combine() function of SparseDispatcher class
#19 opened by Zrealshadow - 4
regression task self.w_gate is nan
#16 opened by JieDengsc - 5
Why logsoftmax in the expert's output?
#13 opened by sofiapvcp - 2
multiple_by_gates after exp
#15 opened by yjw1029 - 5
Issue with gates parameters
#12 opened by elias-ramzi - 3
Question about the noisy top-k gating
#11 opened by huangtinglin - 2
about aux_loss
#10 opened by enterhuiche - 4
- 4
Wrong Implementation in SparseDispatcher
#8 opened by Cascol-Chen - 4
Examples of using real dataset
#3 opened by GabrielLin - 2
Please add license file if open source
#6 opened by yellowlab9 - 1
- 2
Log and Exp- Space
#4 opened by StillerPatrick - 1
Tutorial of using Tensorflow version
#1 opened by GabrielLin