/MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

Primary LanguagePythonMIT LicenseMIT

Pinned issues

[ToDo]: V0.1.6 Iteration Plan

#50 opened by iofu728

Open0

[ToDo]: V0.1.5 Iteration Plan

#27 opened by iofu728

Closed0

Issues