cuda-kernels
There are 269 repositories under cuda-kernels topic.
xlite-dev/LeetCUDA
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
NVIDIA/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Rust-GPU/rust-cuda
Ecosystem of libraries and tools for writing and executing fast GPU code fully in Rust.
NVIDIA/cccl
CUDA Core Compute Libraries
coreylowman/dfdx
Deep learning in Rust, with shape checked tensors and neural networks
coreylowman/cudarc
Safe rust wrapper around CUDA toolkit
NVIDIA/nvbench
CUDA Kernel Benchmarking Library
KernelTuner/kernel_tuner
Kernel Tuner
harrism/hemi
Simple utilities to enable code reuse and portability between CUDA C/C++ and standard C/C++.
jaredhoberock/stanford-cs193g-sp2010
This is an archive of materials produced for an introductory class on CUDA programming at Stanford University in 2010
HenryNdubuaku/cuda-tutorials
CUDA tutorials for Maths & ML tutorials with examples, covers multi-gpus, fused attention, winograd convolution, reinforcement learning.
deepakkumar1984/Amplifier.NET
Amplifier allows .NET developers to easily run complex applications with intensive mathematical computation on Intel CPU/GPU, NVIDIA, AMD without writing any additional C kernel code. Write your function in .NET and Amplifier will take care of running it on your favorite hardware.
PatWie/cuda-design-patterns
Some CUDA design patterns and a bit of template magic for CUDA
alexzhang13/flashattention2-custom-mask
Triton implementation of FlashAttention2 that adds Custom Masks.
tudelft/cuSNN
Spiking Neural Networks in C++ with strong GPU acceleration through CUDA
m-a-n-i-f-e-s-t/power-attention
Attention Kernels for Symmetric Power Transformers
wangsiping97/FastGEMV
High-speed GEMV kernels, at most 2.7x speedup compared to pytorch baseline.
eyalroz/cuda-kat
CUDA kernel author's tools
microsoft/Accera
Open source cross-platform compiler for compute-intensive loops used in AI algorithms, from Microsoft Research
microsoft/TileFusion
TileFusion is an experimental C++ macro kernel template library that elevates the abstraction level in CUDA C for tile processing.
sandyresearch/chipmunk
🎬 3.7× faster video generation E2E 🖼️ 1.6× faster image generation E2E ⚡ ColumnSparseAttn 9.3× vs FlashAttn‑3 💨 ColumnSparseGEMM 2.5× vs cuBLAS
yalue/cuda_scheduling_examiner_mirror
A tool for examining GPU scheduling behavior.
emptysoal/cuda-image-preprocess
Speed up image preprocess with cuda when handle image or tensorrt inference
mikeroyal/CUDA-Guide
CUDA Guide
bgin/RF-EMT
Radio-Frequency Engineering Modeling Toolkit (RF-EMT)
STEllAR-GROUP/octotiger
Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
bgin/Radar_ElectroOptical_Simulation
(REOS) Radar and ElectroOptical Simulation Framework written in Fortran.
p-sto/ConjugateGradients
Implementation of ConjugateGradients method using C and Nvidia CUDA
evlasblom/cuda-opencv-examples
Using custom CUDA kernels with Open CV Mat objects.
HuangCongQing/cuda-learning
cuda编程学习入门
UVA-LavaLab/PIMeval-PIMbench
PIMeval simulator and PIMbench suite
Koushikphy/Intro-to-CUDA-Fortran
A Complete beginner's introduction to programming with CUDA Fortran
aredden/torch-bnb-fp4
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
conradsnicta/bandicoot-code
Bandicoot: C++ library for GPU linear algebra & scientific computing - https://coot.sourceforge.io
yoyoberenguer/PygameShader
2D Game texture special effects