amo33's Stars
ggerganov/llama.cpp
LLM inference in C/C++
YavorGIvanov/sam.cpp
wzsh/wmma_tensorcore_sample
Matrix Multiply-Accumulate with CUDA and WMMA( Tensor Core)
mit-han-lab/TinyChatEngine
TinyChatEngine: On-Device LLM Inference Library
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
pytorch/PiPPy
Pipeline Parallelism for PyTorch
pytorch-labs/applied-ai
Applied AI experiments and examples for PyTorch
htqin/awesome-model-quantization
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
carlushuang/cpu_gemm_opt
how to design cpu gemm on x86 with avx256, that can beat openblas.
leokruglikov/CUDA-notes
Personal notes on CUDA programming
mortennobel/cpp-cheatsheet
Modern C++ Cheatsheet
BobMcDear/attorch
A subset of PyTorch's neural network modules, written in Python using OpenAI's Triton.
dair-ai/ML-Papers-of-the-Week
🔥Highlighting the top ML papers every week.
sail-sg/zero-bubble-pipeline-parallelism
Zero Bubble Pipeline Parallelism
9rum/flatflow
A learned system for parallel training of neural networks
galeselee/Awesome_LLM_System-PaperList
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on inference acceleration, and related works will be gradually added in the future. Welcome contributions!
Dao-AILab/causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
brendangregg/FlameGraph
Stack trace visualizer
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
LitLeo/OpenCUDA
NVIDIA/MatX
An efficient C++17 GPU numerical computing library with Python-like syntax
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
arcee-ai/PruneMe
Automated Identification of Redundant Layer Blocks for Pruning in Large Language Models
SakanaAI/evolutionary-model-merge
Official repository of Evolutionary Optimization of Model Merging Recipes
zinccat/Awesome-Triton-Kernels
Collection of kernels written in Triton language
linkedin/Liger-Kernel
Efficient Triton Kernels for LLM Training
CisMine/Guide-NVIDIA-Tools
NVIDIA tools guide
jaredhoberock/stanford-cs193g-sp2010
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
pentium3/sys_reading
system paper reading notes
Kobzol/hardware-effects-gpu
Demonstration of various hardware effects on CUDA GPUs.