xzyaoi's Stars
open-webui/open-webui
User-friendly WebUI for AI (Formerly Ollama WebUI)
microsoft/graphrag
A modular graph-based Retrieval-Augmented Generation (RAG) system
lm-sys/RouteLLM
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!
pytorch/torchtitan
A native PyTorch Library for large model training
nschloe/tuna
:fish: Python profile viewer
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
microsoft/nnfusion
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
microsoft/MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
forhaoliu/ringattention
Transformers with Arbitrarily Large Context
tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
AmberLJC/LLMSys-PaperList
Large Language Model (LLM) Systems Paper List
NVlabs/DoRA
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
NVIDIA/multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
xdit-project/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
leanstore/leanstore
kjk/edna
Note taking for developers and power users
microsoft/TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
feifeibear/long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
ServerlessLLM/ServerlessLLM
Scalable and Efficient Serverless Deployment for Large AI Models.
microsoft/sarathi-serve
A low-latency & high-throughput serving engine for LLMs
wzsh/wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
lucidrains/PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
HandH1998/QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
infinigence/LVEval
Repository of LV-Eval Benchmark
shadowpa0327/Palu
Code for Palu: Compressing KV-Cache with Low-Rank Projection
vmarinowski/infini-attention
An unofficial pytorch implementation of 'Efficient Infinite Context Transformers with Infini-attention'
unixpickle/learn-ptx
Learning about CUDA by writing PTX code.
CARV-ICS-FORTH/HPK
HPK allows running Kubernetes applications within HPC by translating deployments to Slurm and Singularity/Apptainer