Issues
- 4
AMP + BF16 failing
#95 opened by jramapuram - 2
Question on offsets in figures 5
#61 opened by DaehanKim - 4
Wrong outputs for hidden dim 14336
#46 opened by pierrestock - 1
- 0
- 2
Cloning input `x` in `megablocks.layers.glu.SparseGLU` leads to different SDD outputs
#115 opened by cmsflash - 2
- 1
_LOAD_BALANCING_LOSS returns empty list sometimes
#113 opened by exnx - 1
Bad throughput with GLU
#110 opened by Muennighoff - 0
1-expert worse than dense model
#107 opened by Muennighoff - 4
Sum missing axis arg in kernels.py
#102 opened by jambo6 - 3
support amd/rocm
#97 opened by ehartford - 3
OSError: Stale file handle with dMoE
#106 opened by Muennighoff - 9
- 2
Add a fine-tune script for JetMoE
#105 opened by shamanez - 5
ScatterMoE feature
#104 opened by ehartford - 15
- 2
Implement Mixture of Depth and Experts (MoDE)
#103 opened by casper-hansen - 3
Import dmoe model into other training script?
#101 opened by andrewnc - 1
Computation distribution with expert parallelism
#100 opened by opherlieber - 5
- 2
Does this framework support SFT?
#90 opened by banksy23 - 15
Has anyone encountered this CUDA error?
#62 opened by bozheng-hit - 0
Unsharding scripts for megablocks models
#94 opened by mayank31398 - 2
the wrong loss func was chosen at evaluation
#93 opened by peterjc123 - 3
Seeking a good multi-node training config
#92 opened by rpand002 - 1
selective router precision
#91 opened by 152334H - 8
- 4
Error from pip about missing torch module
#78 opened by michaelwhitford - 3
Docker issues with PyPI installation
#67 opened by sedrick-keh-tri - 6
ParallelDroplessMLP initialises self.mlp twice
#83 opened by 152334H - 4
Gradient scale size for expert gradient
#86 opened by fanshiqing - 2
save loading_balancing_loss properly
#82 opened by gouchangjiang - 1
How to integrate to transformers-based mixtral
#84 opened by nxphi47 - 1
Why the second matrix of the mlp layer has the same shape of the first one?
#81 opened by gouchangjiang - 1
[BUG] Optimizer Weights Not Reloaded When Training with bf16 Pretrained Weights
#80 opened by RookieHong - 4
Comparison against top-2 routing?
#49 opened by sunnyszy - 1
Script for Full Fine-Tuning of Mixtral
#68 opened by alpayariyak - 2
Efficiency of torch mlp
#77 opened by imoneoi - 5
- 14
How to add support for swiglu in Megablocks?
#35 opened by fanshiqing - 4
About the Multi-node Script
#59 opened by XingyuXie - 5
Inference code
#48 opened by AlpinDale - 2
How to pip install the latest megablocks?
#32 opened by fanshiqing - 2
- 4
- 7
Why not support tensor model parallel?
#40 opened by Richie-yan - 5
multi-node problem
#18 opened by sudahui - 2
- 1