ziyuhuang123's Stars
d2l-ai/d2l-zh
《动手学深度学习》:面向中文读者、能运行、可讨论。中英文版被70多个国家的500多所大学用于教学。
meta-llama/llama
Inference code for Llama models
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
apache/tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
NVIDIA/cutlass
CUDA Templates for Linear Algebra Subroutines
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
gpgpu-sim/gpgpu-sim_distribution
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
tspeterkim/flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
forhaoliu/ringattention
Transformers with Arbitrarily Large Context
google-research/maxvit
[ECCV 2022] Official repository for "MaxViT: Multi-Axis Vision Transformer". SOTA foundation models for classification, detection, segmentation, image quality, and generative modeling...
cli99/llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
amd/xdna-driver
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
TiledTensor/TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
doongz/aics
智能计算系统 AI Computing Systems 陈云霁
Accelergy-Project/timeloop-accelergy-exercises
Exercises for exploring the Fibertree, Timeloop and Accelergy tools
crosetto/cupq
a CUDA implementation of a priority queue
curtisseizert/CUDASieve
A GPU accelerated implementation of the sieve of Eratosthenes
ColfaxResearch/cfx-article-src
pku-liang/TileFlow
TileFlow is a performance analysis tool based on Timeloop for fusion dataflows
microsoft/cusync
MatanHamilis/one_stencil
Multiple 1-stencil implementations using nvidia cuda.
KnowingNothing/Domino
gty111/GEMM_WMMA
GEMM by WMMA (tensor core)
parasailteam/cusync
nuno-azevedo/floyd-warshall-mpi
Parallel Computing - Floyd-Warshall MPI
galeselee/microbenchmark
Some microbenchmark practices
maffinnn/consistent_hash