tangpanyu

tangpanyu's Stars

ggerganov/llama.cpp
LLM inference in C/C++
Language:C++74.3k 571 4.5k10.7k
RVC-Boss/GPT-SoVITS
1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
Language:Python43k 235 1.6k4.8k
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
Language:Python16.5k 124 1.3k1.6k
triton-lang/triton
Development repository for the Triton language and compiler
Language:MLIR15k 196 1.7k1.9k
facebookresearch/sam2
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use the model.
Language:Jupyter Notebook14.7k 82 4911.6k
davisking/dlib
A toolkit for making real world machine learning and data analysis applications in C++
Language:C++13.9k 479 2.2k3.4k
rui314/chibicc
A small C compiler
Language:C10k 173 115911
DA-southampton/NLP_ability
总结梳理自然语言处理工程师(NLP)需要积累的各方面知识，包括面试题，各种基础知识，工程能力等等，提升核心竞争力
Language:Python7.2k 106 51.2k
NVIDIA/cuda-samples
Samples for CUDA Developers which demonstrates features in CUDA Toolkit
Language:C7.2k 118 2632k
DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Language:Cuda3k 22 20308
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
Language:Cuda2.5k 27 241258
CodingHanYa/workspace
workspace是基于C++11的轻量级异步执行框架，支持：通用任务异步并发执行、优先级任务调度、自适应动态线程池、高效静态线程池、异常处理机制等。
Language:C++1.1k 8 33168
Liu-xiandong/How_to_optimize_in_GPU
This is a series of GPU optimization topics. Here we will introduce how to optimize the CUDA kernel in detail. I will introduce several basic kernel optimizations, including: elementwise, reduce, sgemv, sgemm, etc. The performance of these kernels is basically at or near the theoretical limit.
Language:Cuda963 13 16150
kaitoukito/Computer-Science-Textbooks
Collect some CS textbooks for learning.
906 14 0250
mst272/LLM-Dojo
欢迎来到 LLM-Dojo，这里是一个开源大模型学习场所，使用简洁且易阅读的代码构建模型训练框架(支持各种主流模型如Qwen、Llama、GLM等等)、RLHF框架(DPO/CPO/KTO/PPO)等各种功能。👩‍🎓👨‍🎓
Language:Python624 2 2758
zeux/calm
CUDA/Metal accelerated language model inference
Language:C529 13 023
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language:Cuda368 5 1474
Cjkkkk/CUDA_gemm
A simple high performance CUDA GEMM implementation.
Language:Cuda353 4 340
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
Language:C++331 10 1238
66RING/tiny-flash-attention
flash attention tutorial written in python, triton, cuda, cutlass
Language:Cuda305 3 1232
gavinliu6/Makefile-Tutorial-zh-CN
Makefile 教程
Language:HTML281 5 231
parallel101/stl1weekend
Build your own STL in one weekend
Language:C++268 4 1020
TiledTensor/TiledCUDA
We invite you to visit and follow our new repository at https://github.com/microsoft/TileFusion. TiledCUDA is a highly efficient kernel template library designed to elevate CUDA C’s level of abstraction for processing tiles.
Language:C++179 3 6411
AyakaGEMM/Hands-on-GEMM
Language:Cuda113 2 416
reed-lau/cute-gemm
Language:C++107 2 532
BooHwang/segment_anything_tensorrt
Accelerate segment anything model inference using Tensorrt 8.6.1.6
Language:Python87 2 67
NVIDIA/online-softmax
Benchmark code for the "Online normalizer calculation for softmax" paper
Language:Cuda85 5 07
ankan-ban/llama_cu_awq
llama INT4 cuda inference with AWQ
Language:C++51 3 96
ankan-ban/llama2.cu
Inference Llama 2 in one file of pure Cuda
Language:Python17 2 12
Tongkaio/MoE_inference
CUDA MoE kernels.
Language:Cuda30