xzyaoi

ETH Zurich / @eth-easl Zurich

xzyaoi's Stars

open-webui/open-webui
User-friendly WebUI for AI (Formerly Ollama WebUI)
Language:Svelte40.5k 200 2.2k4.8k
microsoft/graphrag
A modular graph-based Retrieval-Augmented Generation (RAG) system
Language:Python17.6k 111 4641.7k
lm-sys/RouteLLM
A framework for serving and evaluating LLM routers - save LLM costs without compromising quality!
Language:Python2.9k 25 46227
pytorch/torchtitan
A native PyTorch Library for large model training
Language:Python2.2k 37 135164
nschloe/tuna
:fish: Python profile viewer
Language:Python1.4k 7 4233
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
1.1k 12 422
microsoft/nnfusion
A flexible and efficient deep neural network (DNN) compiler that generates high-performance executable from a DNN model description.
Language:C++953 43 206158
microsoft/MInference
To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
Language:Python707 6 5326
forhaoliu/ringattention
Transformers with Arbitrarily Large Context
Language:Python619 6 1648
tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
Language:C++585 16 1378
AmberLJC/LLMSys-PaperList
Large Language Model (LLM) Systems Paper List
584 23 024
NVlabs/DoRA
[ICML2024 (Oral)] Official PyTorch implementation of DoRA: Weight-Decomposed Low-Rank Adaptation
Language:Python580 9 1634
NVIDIA/multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Language:Cuda531 29 10106
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
Language:Cuda529 5 1421
xdit-project/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) on multi-GPU Clusters
Language:Python514 4 6846
leanstore/leanstore
Language:C++477 16 1359
kjk/edna
Note taking for developers and power users
Language:JavaScript371 4 7216
microsoft/TransformerCompression
For releasing code related to compression methods for transformers, accompanying our publications
Language:Python359 9 4534
feifeibear/long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
Language:Python317 4 1620
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
Language:Jupyter Notebook249 8 1138
ServerlessLLM/ServerlessLLM
Scalable and Efficient Serverless Deployment for Large AI Models.
Language:Python190 12 4119
microsoft/sarathi-serve
A low-latency & high-throughput serving engine for LLMs
Language:Python179 5 1726
wzsh/wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
Language:Cuda109 4 217
lucidrains/PEER-pytorch
Pytorch implementation of the PEER block from the paper, Mixture of A Million Experts, by Xu Owen He at Deepmind
Language:Python106 4 32
HandH1998/QQQ
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.
Language:Python61 5 144
infinigence/LVEval
Repository of LV-Eval Benchmark
Language:Python44 2 34
shadowpa0327/Palu
Code for Palu: Compressing KV-Cache with Low-Rank Projection
Language:Python41 2 32
vmarinowski/infini-attention
An unofficial pytorch implementation of 'Efficient Infinite Context Transformers with Infini-attention'
Language:Python35 1 47
unixpickle/learn-ptx
Learning about CUDA by writing PTX code.
Language:Python28 2 0
CARV-ICS-FORTH/HPK
HPK allows running Kubernetes applications within HPC by translating deployments to Slurm and Singularity/Apptainer
Language:Go16 9 455

xzyaoi

xzyaoi's Stars

open-webui/open-webui

microsoft/graphrag

lm-sys/RouteLLM

pytorch/torchtitan

nschloe/tuna

kvcache-ai/Mooncake

microsoft/nnfusion

microsoft/MInference

forhaoliu/ringattention

tpoisonooo/how-to-optimize-gemm

AmberLJC/LLMSys-PaperList

NVlabs/DoRA

NVIDIA/multi-gpu-programming-models

efeslab/Nanoflow

xdit-project/xDiT

leanstore/leanstore

kjk/edna

microsoft/TransformerCompression

feifeibear/long-context-attention

te42kyfo/gpu-benches

ServerlessLLM/ServerlessLLM

microsoft/sarathi-serve

wzsh/wmma_tensorcore_sample

lucidrains/PEER-pytorch

HandH1998/QQQ

infinigence/LVEval

shadowpa0327/Palu

vmarinowski/infini-attention

unixpickle/learn-ptx

CARV-ICS-FORTH/HPK