Implementation of the paper: "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models". From the paper: "These models match baseline performance for equivalent FLOPS and wall-clock times to train, but require a fraction of the FLOPs per forward pass, and can be upwards of 50% faster to step during post-training sampling."
pip3 install mixture-of-depths
# License
MIT