Kipsora's Stars
ocornut/imgui
Dear ImGui: Bloat-free Graphical User interface for C++ with minimal dependencies
hpcaitech/Open-Sora
Open-Sora: Democratizing Efficient Video Production for All
sgl-project/sglang
SGLang is a fast serving framework for large language models and vision language models.
adam-maj/tiny-gpu
A minimal GPU design in Verilog to learn how GPUs work from the ground up
NVIDIA/cutlass
CUDA Templates for Linear Algebra Subroutines
hojonathanho/diffusion
Denoising Diffusion Probabilistic Models
HazyResearch/ThunderKittens
Tile primitives for speedy kernels
gpu-mode/resource-stream
GPU programming related news and material links
NVIDIA/gdrcopy
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
bytedance/flux
A fast communication-overlapping library for tensor/expert parallelism on GPUs.
NVIDIA/multi-gpu-programming-models
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
feifeibear/long-context-attention
USP: Unified (a.k.a. Hybrid, 2D) Sequence Parallel Attention for Long Context Transformers Model Training and Inference
accel-sim/accel-sim-framework
This is the top-level repository for the Accel-Sim framework.
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
microsoft/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
HazyResearch/flash-fft-conv
FlashFFTConv: Efficient Convolutions for Long Sequences with Tensor Cores
66RING/tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
appl-team/appl
🍎APPL: A Prompt Programming Language. Seamlessly integrate LLMs with programs.
shawntan/scattermoe
Triton-based implementation of Sparse Mixture of Experts.
DefTruth/Awesome-Diffusion-Inference
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
ColfaxResearch/cutlass-kernels
TiledTensor/TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
njuhope/cuda_sgemm
ColfaxResearch/cfx-article-src
tgale96/grouped_gemm
PyTorch bindings for CUTLASS grouped GEMM.
andylolu2/simpleGEMM
The simplest but fast implementation of matrix multiplication in CUDA.
mcrl/tccl
Thunder Research Group's Collective Communication Library
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).