Find my notes in the markdown file
- Mixtral-of-Experts
- ScalableandEfficientMoETrainingforMultitaskMultilingualModels
- ST-MoE
- UNIFIED-SCALING-LAWS-FOR-ROUTED-LANGUAGE-MODELS
- Switch-Transformers
- LimitsofTransferLearningwhUnified
- EMPIRICAL-UNDERSTANDING-OF-MOE-DESIGN-CHOICES
- OUTRAGEOUSLY-LARGE-NEURAL-NETWORKS
- MoE-CROSS-EXAMPLE-AGGREGATION
- Transferable-Adversarial-Robustness-for-Categorical-Data-via-Universal-Robust-Embedding
- Hash-Layers-For-Large-Sparse-Models
- How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections
- Combining-Recurrent-Convolutional-and-Continuous-time-Models-with-Linear-State-Space-Layers
- Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Vision-Mamba
- RepeatAfterMe
- Efficiently-ModLongSeqwhSSS
- Hungry Hungry Hippos: Towards Language Modeling with State Space Models
- MambaByte
- Legendre Memory Units
- HiPPO Recurrent Memory