DefTruth

📚CUDA | LLM | VLM | Diffusion | AI Infra

vipshop.comGuangzhou, China

Pinned Repositories

Awesome-Diffusion-Inference
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
198 8 013
Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, MLA, Parallelism, Prefix-Cache, Chunked-Prefill, etc. 🎉🎉
3.7k 116 6260
CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Language:Cuda3k 22 20308
ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
Language:Cuda148 2 56
lite.ai.toolkit
🛠 A lite C++ toolkit of 100+ Awesome AI models, support ORT, MNN, NCNN, TNN and TensorRT. 🎉🎉
Language:C++4k 67 272736
RVM-Inference
🔥Robust Video Matting C++ inference toolkit with ONNXRuntime、MNN、NCNN and TNN, via lite.ai.toolkit.
Language:C++123 3 3827
statistic-learning-R-note
📒《统计学习方法-李航: 笔记-从原理到实现，基于R语言》200页PDF，各种手推公式细节讲解，R语言实现. 🎉🎉
441 3 255
torchlm
💎A high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations, can easily install via pip.
Language:Python255 7 2525
FastDeploy
⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
Language:C++3.1k 53 1.2k475
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python42.3k 348 7.4k6.4k

DefTruth's Repositories

DefTruth/lite.ai.toolkit
🛠 A lite C++ toolkit of 100+ Awesome AI models, support ORT, MNN, NCNN, TNN and TensorRT. 🎉🎉
Language:C++4k 67 272736
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, MLA, Parallelism, Prefix-Cache, Chunked-Prefill, etc. 🎉🎉
3.7k 116 6260
DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
Language:Cuda3k 22 20308
DefTruth/statistic-learning-R-note
📒《统计学习方法-李航: 笔记-从原理到实现，基于R语言》200页PDF，各种手推公式细节讲解，R语言实现. 🎉🎉
441 3 255
DefTruth/torchlm
💎A high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations, can easily install via pip.
Language:Python255 7 2525
DefTruth/Awesome-Diffusion-Inference
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
198 8 013
DefTruth/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
Language:Cuda148 2 56
DefTruth/RVM-Inference
🔥Robust Video Matting C++ inference toolkit with ONNXRuntime、MNN、NCNN and TNN, via lite.ai.toolkit.
Language:C++123 3 3827
DefTruth/hgemm-mma
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
Language:Cuda59 1 12
DefTruth/DefTruth
4 3 02
DefTruth/triton
Development repository for the Triton language and compiler
Language:C++4 1 0
DefTruth/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Language:C++2 1 0
DefTruth/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python2 1 0
DefTruth/cutlass
CUDA Templates for Linear Algebra Subroutines
Language:C++1 1 0
DefTruth/flash-attention
Fast and memory-efficient exact attention
Language:Python1 1 0
DefTruth/FlashMLA
FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
Language:C++1 0 0
DefTruth/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Language:Python1 0 0
DefTruth/llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
Language:HTML1 0 01
DefTruth/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
Language:Python1 0 0
DefTruth/MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Language:Python1 0 0
DefTruth/chain-of-draft
Code and data for the Chain-of-Draft (CoD) paper
DefTruth/CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Language:Python0 0
DefTruth/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Language:Cuda0 0
DefTruth/flashinfer
FlashInfer: Kernel Library for LLM Serving
Language:Cuda1 0
DefTruth/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLM
Language:Python1 0
DefTruth/ParaAttention
Context parallel attention that accelerates DiT model inference with dynamic caching
Language:Python0 0
DefTruth/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Language:Cuda0 0
DefTruth/sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
Language:Python0 0
DefTruth/unlock-deepseek
DeepSeek 系列工作解读、扩展和复现。
Language:Python0 0
DefTruth/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
Language:Python0 0