fast_mamba.np

fast_mamba.np is a pure and efficient NumPy implementation of Mamba, featuring cache support. This code preserves the native caching capability of Mamba while maintaining a simple and clear implementation. The use of caching prevents the need to recompute previous tokens, resulting in a performance boost that can reach up to 4x for 100 tokens, compared to mamba.np.

Usage

$ python fast_mamba.py "I have a dream that"
"""
I have a dream that I will be able to see the sunrise in the morning.

Token count: 18, elapsed: 9.65s, 1.9 tokens/s
"""

Citing fast_mamba.np

If you use or discuss fast_mamba.np in your academic research, please cite the project to help spread awareness:

@misc{fast_mamba.np,
  title = {fast_mamba.np: pure NumPy implementation for Mamba},
  author = {Ido Hakimi}, 
  howpublished = {\url{https://github.com/idoh/fast_mamba.np}},
  note = {fast_mamba.np, MIT License}
  year = {2024},
}

References

Thank you to the creators of the following libraries and tools and their contributors:

mamba-minimal - @johnma2006
llama3.np - @likejazz
The Mamba architecture was introduced in Mamba: Linear-Time Sequence Modeling with Selective State Spaces by Albert Gu and Tri Dao
The official implementation is here: https://github.com/state-spaces/mamba
Title image was generated by Microsoft Designer and edited which Canva Photo Editor