leliyliu's Stars
UbiquitousLearning/mllm
Fast Multimodal LLM on Mobile Devices
casys-kaist/LLMServingSim
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
DefTruth/CUDA-Learn-Notes
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
zhentingqi/rStar
agiresearch/AIOS
AIOS: LLM Agent Operating System
Jason-cs18/HetServe-LLMs
A Overview of Efficiently Serving Large Language Models across Edge Devices
intelligent-machine-learning/glake
GLake: optimizing GPU memory management and IO transmission.
sgl-project/sglang
SGLang is a fast serving framework for large language models and vision language models.
onejune2018/Awesome-LLM-Eval
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
AIoT-MLSys-Lab/Efficient-LLMs-Survey
[TMLR 2024] Efficient Large Language Models: A Survey
ggerganov/ggml
Tensor library for machine learning
KnowingNothing/compiler-and-arch
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
metame-ai/awesome-llm-plaza
awesome llm plaza: daily tracking all sorts of awesome topics of llm, e.g. llm for coding, robotics, reasoning, multimod etc.
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
sramshetty/mixture-of-depths
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
Efficient-ML/Awesome-Efficient-LLM-Diffusion
A list of papers, docs, codes about efficient AIGC. This repo is aimed to provide the info for efficient AIGC research, including language and vision, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
casys-kaist/NeuPIMs
NeuPIMs Simulator
scale-snu/attacc_simulator
PSAL-POSTECH/ONNXim
ONNXim is a fast cycle-level simulator that can model multi-core NPUs for DNN inference
CMU-SAFARI/ramulator2
Ramulator 2.0 is a modern, modular, extensible, and fast cycle-accurate DRAM simulator. It provides support for agile implementation and evaluation of new memory system designs (e.g., new DRAM standards, emerging RowHammer mitigation techniques). Described in our paper https://people.inf.ethz.ch/omutlu/pub/Ramulator2_arxiv23.pdf
hahnyuan/LLM-Viewer
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
AmberLJC/LLMSys-PaperList
Large Language Model (LLM) Systems Paper List
AmadeusChan/Awesome-LLM-System-Papers
BBuf/how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
MARD1NO/CUDA-PPT
ptillet/torch-blocksparse
Block-sparse primitives for PyTorch
hpc-ulisboa/NDPmulator
A Full-System Framework for Simulating NDP devices from Caches to DRAM
ptillet/triton
Development repository for the Triton language and compiler
huggingface/pytorch_block_sparse
Fast Block Sparse Matrices for Pytorch