Jameskry's Stars
reed-lau/cute-gemm
ifromeast/cuda_learning
learning how CUDA works
njuhope/cuda_sgemm
kebijuelun/Awesome-LLM-Learning
Learning Large Language Model (LLM)(大语言模型学习)
shouxieai/word_2_vec
word_2_vec
USCT-YQJ/custom_prpool_plugin
QINZHAOYU/CudaSteps
基于《cuda编程-基础与实践》(樊哲勇 著)的cuda学习之路。
ekondis/mixbench
A GPU benchmark tool for evaluating GPUs and CPUs on mixed operational intensity kernels (CUDA, OpenCL, HIP, SYCL, OpenMP)
zhangkai0425/SGEMM-HPC
Implementation and optimization of matrix multiplication on single CPU (HPC-THU-2023-Autumn)
HorizonRDK/hobot_codec
DefTruth/CUDA-Learn-Notes
🎉CUDA 笔记 / 大模型手撕CUDA / C++笔记,更新随缘: flash_attn、sgemm、sgemv、warp reduce、block reduce、dot product、elementwise、softmax、layernorm、rmsnorm、hist etc.
XiaoSong9905/CUDA-Optimization-Guide
Xiao's CUDA Optimization Guide [Active Adding New Contents]
ApolloAuto/apollo
An open autonomous driving platform
sesmfs/onnx_quant_tool
An onnx-based quantitation tool.
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
qist/tvbox
FongMi影视、tvbox配置文件,如果喜欢,请Fork自用。使用前请仔细阅读仓库说明,一旦使用将被视为你已了解。
nicolaswilde/cuda-tensorcore-hgemm
Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
462630221/SampleCode
openppl-public/ppl.nn
A primitive library for neural network
graykode/nlp-tutorial
Natural Language Processing Tutorial for Deep Learning Researchers
hellogcc/100-gdb-tips
A collection of gdb tips. 100 maybe just mean many here.
Bruce-Lee-LY/cuda_hgemv
Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.
foreverrookie/cuda-opt-samples
CUDA optimization samples including sgemm, reduce... To be continued.
jundaf2/CUDA-INT8-GEMM
CUDA 8-bit Tensor Core Matrix Multiplication based on m16n16k16 WMMA API
triton-inference-server/server
The Triton Inference Server provides an optimized cloud and edge inferencing solution.
tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
sesmfs/onnx_matcher
Using pattern matcher in onnx model to match and replace subgraphs.
cyrusbehr/YOLOv8-TensorRT-CPP
YOLOv8 TensorRT C++ Implementation
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.