/voltronformers

Assembling the best SotA AI techniques into a unified model

Apache License 2.0Apache-2.0

voltronformers

Assembling the best SotA AI techniques into a unified model

- 13B parameter BitNet + infini-Attention + DenseFormer + MoD + 
  In Context-Pretraining + 2 stage pretraining 
- upcycle w c-BTX to an 8 expert sparse MoE + MoA 

https://twitter.com/winglian/status/1778675583817326842

References

BitNet

BitNet: Scaling 1-bit Transformers for Large Language Models

DenseFormer

DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Mixture-of-Depths

Mixture-of-Depths: Dynamically allocating compute in transformer-based language models

In-Context Pretraining

In-Context Pretraining: Language Modeling Beyond Document Boundaries

MiniCPM (Two Stage Pre-training Strategy)

MiniCPM: Unveiling the Potential of Small Language Models with Scalable Training Strategies

Cluster-Branch-Train-Merge (c-BTM)

Scaling Expert Language Models with Unsupervised Domain Discovery

Branch-Train-MiX (BTX)

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM

Mixture Of Attention Heads

Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM