/atoma-paged-attention

Paged attention cuda kernels for the Atoma protocol

Primary LanguageRust

Atoma Paged Attention

Atoma Logo

A collection of Large Language Models (LLMs) optimized for KV cache memory management, through Paged Attention see here. In particular, Paged Attention allows for optimized inference serving, which is crucial for Atoma nodes.

Integration with Candle

Our infra integrates with the HuggingFace candle ML framework, a fully Rust based ML framework. Candle allows to rely on the blazing Rust performance and its memory safety. Moreover, it eases the process of integrating Machine Learning pipelines with AI inference distributed systems. For these reasons, we believe this repository can be of great value to both the ML and Rust communities.