ziyuhuang123

ziyuhuang123's Stars

d2l-ai/d2l-zh
《动手学深度学习》：面向中文读者、能运行、可讨论。中英文版被70多个国家的500多所大学用于教学。
Language:Python64.4k 1.1k 011.1k
meta-llama/llama
Inference code for Llama models
Language:Python56.8k 526 1k9.6k
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python32k 257 5.6k4.9k
apache/tvm
Open deep learning compiler stack for cpu, gpu and specialized accelerators
Language:Python11.9k 376 3.4k3.5k
NVIDIA/cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++5.8k 109 1.2k1k
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
Language:Cuda1.7k 30 3179
gpgpu-sim/gpgpu-sim_distribution
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
Language:C++1.2k 46 171514
IST-DASLab/marlin
FP16xINT4 LLM inference kernel that can achieve near-ideal ~4x speedups up to medium batchsizes of 16-32 tokens.
Language:Python655 15 2952
tspeterkim/flash-attention-minimal
Flash Attention in ~100 lines of CUDA (forward pass only)
Language:Cuda651 4 658
forhaoliu/ringattention
Transformers with Arbitrarily Large Context
Language:Python625 6 1648
google-research/maxvit
[ECCV 2022] Official repository for "MaxViT: Multi-Axis Vision Transformer". SOTA foundation models for classification, detection, segmentation, image quality, and generative modeling...
Language:Jupyter Notebook451 9 2031
cli99/llm-analysis
Latency and Memory Analysis of Transformer Models for Training and Inference
Language:Python358 9 1042
amd/xdna-driver
Language:C331 26 3943
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
Language:C++298 8 1131
te42kyfo/gpu-benches
collection of benchmarks to measure basic GPU capabilities
Language:Jupyter Notebook264 9 1141
TiledTensor/TiledCUDA
TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
Language:C++167 3 6410
doongz/aics
智能计算系统 AI Computing Systems 陈云霁
Language:C++136 2 016
Accelergy-Project/timeloop-accelergy-exercises
Exercises for exploring the Fibertree, Timeloop and Accelergy tools
Language:Jupyter Notebook86 9 5429
crosetto/cupq
a CUDA implementation of a priority queue
Language:C++83 3 15
curtisseizert/CUDASieve
A GPU accelerated implementation of the sieve of Eratosthenes
Language:Cuda62 8 716
ColfaxResearch/cfx-article-src
Language:C++55 5 314
pku-liang/TileFlow
TileFlow is a performance analysis tool based on Timeloop for fusion dataflows
Language:C++55 1 06
microsoft/cusync
Language:C++20 5 23
MatanHamilis/one_stencil
Multiple 1-stencil implementations using nvidia cuda.
Language:Cuda13 1 04
KnowingNothing/Domino
Language:Python9 1 00
gty111/GEMM_WMMA
GEMM by WMMA (tensor core)
Language:Cuda8 1 07
parasailteam/cusync
Language:C++5 2 72
nuno-azevedo/floyd-warshall-mpi
Parallel Computing - Floyd-Warshall MPI
Language:TeX3 1 05
galeselee/microbenchmark
Some microbenchmark practices
Language:Cuda1
maffinnn/consistent_hash
Language:Go10