/Mixture-of-Depths

Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"

Primary LanguagePythonMIT LicenseMIT

Multi-Modality

Mixture of Depths Scaling

Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models". From the paper: "These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling."

install

pip3 install mixture-of-depths

usage


# License
MIT