DefTruth
⚙️Now. @vipshop | Owner. @xlite-dev | 🤖Prev. @PaddlePaddle | 📚LeetCUDA | 🤗cache-dit | lite.ai.tookit | 🤖ffpa-attn
@xlite-dev, @vipshopGuangzhou, China
Pinned Repositories
Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
BlogLearning
自己的学习历程,重点包括各种好玩的图像处理算法、运动捕捉、机器学习
CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
lite.ai.toolkit
🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉
FastDeploy
High-performance Inference and Deployment Toolkit for LLMs and VLMs based on PaddlePaddle
cache-dit
A Unified Cache Acceleration Toolbox for 🤗Diffusers: FLUX.1, Qwen-Image-Edit, Qwen-Image, HunyuanImage-2.1, Wan 2.1/2.2, etc.
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Awesome-LLM-Inference
📚A curated list of Awesome LLM/VLM Inference Papers with Codes: Flash-Attention, Paged-Attention, WINT8/4, Parallelism, etc.🎉
LeetCUDA
📚LeetCUDA: Modern CUDA Learn Notes with PyTorch for Beginners🐑, 200+ CUDA Kernels, Tensor Cores, HGEMM, FA-2 MMA.🎉
lite.ai.toolkit
🛠A lite C++ AI toolkit: 100+ models with MNN, ORT and TRT, including Det, Seg, Stable-Diffusion, Face-Fusion, etc.🎉
DefTruth's Repositories
DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
DefTruth/lite.ai.toolkit
🛠 A lite C++ toolkit: contains 100+ Awesome AI models, support MNN, NCNN, TNN, ONNXRuntime and TensorRT. 🎉🎉
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, FlashAttention, PagedAttention, MLA, Parallelism, etc. 🎉🎉
DefTruth/hgemm-mma
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
DefTruth/DefTruth-backup
DefTruth/triton
Development repository for the Triton language and compiler
DefTruth/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
DefTruth/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
DefTruth/TensorRT-Model-Optimizer
TensorRT Model Optimizer is a unified library of state-of-the-art model optimization techniques such as quantization and sparsity. It compresses deep learning models for downstream deployment frameworks like TensorRT-LLM or TensorRT to optimize inference speed on NVIDIA GPUs.
DefTruth/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
DefTruth/cutlass
CUDA Templates for Linear Algebra Subroutines
DefTruth/flash-attention
Fast and memory-efficient exact attention
DefTruth/FlashMLA
FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
DefTruth/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
DefTruth/llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
DefTruth/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
DefTruth/MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
DefTruth/MInference
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.
DefTruth/Awesome-Video-Attention
A curated list of recent papers on efficient video attention for video diffusion models, including sparsification, quantization, and caching, etc.
DefTruth/cache-dit
🤗CacheDiT: A Training-free and Easy-to-use Cache Acceleration Toolbox for Diffusion Transformers
DefTruth/chain-of-draft
Code and data for the Chain-of-Draft (CoD) paper
DefTruth/CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
DefTruth/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
DefTruth/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLM
DefTruth/ParaAttention
Context parallel attention that accelerates DiT model inference with dynamic caching
DefTruth/sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
DefTruth/SpargeAttn
SpargeAttention: A training-free sparse attention that can accelerate any model inference.
DefTruth/TensorRT
NVIDIA® TensorRT™, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications.
DefTruth/unlock-deepseek
DeepSeek 系列工作解读、扩展和复现。
DefTruth/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism