minefantast's Stars
zjhellofss/KuiperLLama
校招、秋招、春招、实习好项目,带你从零动手实现支持LLama2/3和Qwen2.5的大模型推理框架。
DefTruth/CUDA-Learn-Notes
📚Modern CUDA Learn Notes with PyTorch: Tensor/CUDA Cores, 📖150+ CUDA Kernels with PyTorch bindings, 📖HGEMM/SGEMM (95%~99% cuBLAS performance), 📖100+ LLM/CUDA Blogs.
daemyung/metal-by-tutorials-2nd
Metal by Tutorials By the raywenderlich Tutorial Team
Yinghan-Li/YHs_Sample
Yinghan's Code Sample
KhronosGroup/Vulkan-Samples
One stop solution for all Vulkan samples
Bruce-Lee-LY/decoding_attention
Decoding Attention is specially optimized for multi head attention (MHA) using CUDA core for the decoding stage of LLM inference.
Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
leimao/TensorRT-Custom-Plugin-Example
Quick and Self-Contained TensorRT Custom Plugin Implementation and Integration
leimao/CUDA-GEMM-Optimization
CUDA Matrix Multiplication Optimization
openai/openai-gemm
Open single and half precision gemm implementations
KhronosGroup/glslang
Khronos-reference front end for GLSL/ESSL, partial front end for HLSL, and a SPIR-V generator.
Keenuts/vulkan-compute
related to virglrender-vulkan: basic compute test application
vblanco20-1/vulkan-guide
Introductory guide to vulkan.
KomputeProject/kompute
General purpose GPU compute framework built on Vulkan to support 1000s of cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends). Blazing fast, mobile-enabled, asynchronous and optimized for advanced GPU data processing usecases. Backed by the Linux Foundation.
NVIDIA/TransformerEngine
A library for accelerating Transformer models on NVIDIA GPUs, including using 8-bit floating point (FP8) precision on Hopper and Ada GPUs, to provide better performance with lower memory utilization in both training and inference.
Sunt-ing/stick
:innocent: A PyTorch-like deep learning framework. Just for fun.
feifeibear/LLMSpeculativeSampling
Fast inference from large lauguage models via speculative decoding
tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
NervanaSystems/maxas
Assembler for NVIDIA Maxwell architecture
cloudcores/CuAssembler
An unofficial cuda assembler, for all generations of SASS, hopefully :)
minitorch/minitorch
The full minitorch student suite.
GetUpEarlier/minit
karpathy/llm.c
LLM training in simple, raw C/CUDA
Jittor/jittor
Jittor is a high-performance deep learning framework based on JIT compiling and meta-operators.
mlc-ai/notebooks
mlc-ai/mlc-zh
hyperai/tvm-cn
TVM Documentation in Chinese Simplified / TVM 中文文档
alibaba/MNN
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
ggerganov/llama.cpp
LLM inference in C/C++