lucidrains/mixture-of-experts

A Pytorch implementation of Sparsely-Gated Mixture of Experts, for massively increasing the parameter count of language models

PythonMIT

Issues

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
#5 opened 3 years ago by mxs30443
1
PEER implementation
#11 opened 3 months ago by huu4ontocord
1
Load balancing loss?
#10 opened a year ago by Aman-Goel1
2
Would you elaborate more on the enhancement?
#9 opened a year ago by yhyu13
0
Regarding experts.w1 and experts.w2 gradients
#7 opened 3 years ago by MukundVarmaT
1
convolution operation
#8 opened 2 years ago by Yonsun-w
0
implicit inplace operation '*=' cause an error when deriving the back gradient in pytorch
#6 opened 3 years ago by VRCMF
1
question about why need an exclusive cumsum in gating method?
#4 opened 4 years ago by kugwzk
0
Error reported under FP16 training
#3 opened 4 years ago by SefaZeng
1
RuntimeError: expected backend CPU and dtype Float but got backend CPU and dtype Long
#2 opened 4 years ago by littlepan0413
1
Segmentation Fault?
#1 opened 4 years ago by SungMinCho
1