Pinned Repositories
Awesome-Diffusion-Inference
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, MLA, Parallelism, Prefix-Cache, Chunked-Prefill, etc. 🎉🎉
CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
lite.ai.toolkit
🛠 A lite C++ toolkit of 100+ Awesome AI models, support ORT, MNN, NCNN, TNN and TensorRT. 🎉🎉
RVM-Inference
🔥Robust Video Matting C++ inference toolkit with ONNXRuntime、MNN、NCNN and TNN, via lite.ai.toolkit.
statistic-learning-R-note
📒《统计学习方法-李航: 笔记-从原理到实现,基于R语言》200页PDF,各种手推公式细节讲解,R语言实现. 🎉🎉
torchlm
💎A high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations, can easily install via pip.
FastDeploy
⚡️An Easy-to-use and Fast Deep Learning Model Deployment Toolkit for ☁️Cloud 📱Mobile and 📹Edge. Including Image, Video, Text and Audio 20+ main stream scenarios and 150+ SOTA models with end-to-end optimization, multi-platform and multi-framework support.
vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
DefTruth's Repositories
DefTruth/lite.ai.toolkit
🛠 A lite C++ toolkit of 100+ Awesome AI models, support ORT, MNN, NCNN, TNN and TensorRT. 🎉🎉
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, MLA, Parallelism, Prefix-Cache, Chunked-Prefill, etc. 🎉🎉
DefTruth/CUDA-Learn-Notes
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
DefTruth/statistic-learning-R-note
📒《统计学习方法-李航: 笔记-从原理到实现,基于R语言》200页PDF,各种手推公式细节讲解,R语言实现. 🎉🎉
DefTruth/torchlm
💎A high level pipeline for face landmarks detection, it supports training, evaluating, exporting, inference(Python/C++) and 100+ data augmentations, can easily install via pip.
DefTruth/Awesome-Diffusion-Inference
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
DefTruth/ffpa-attn-mma
📚FFPA(Split-D): Yet another Faster Flash Prefill Attention with O(1) GPU SRAM complexity for headdim > 256, ~2x↑🎉vs SDPA EA.
DefTruth/RVM-Inference
🔥Robust Video Matting C++ inference toolkit with ONNXRuntime、MNN、NCNN and TNN, via lite.ai.toolkit.
DefTruth/hgemm-mma
⚡️Write HGEMM from scratch using Tensor Cores with WMMA, MMA and CuTe API, Achieve Peak⚡️ Performance.
DefTruth/DefTruth
DefTruth/triton
Development repository for the Triton language and compiler
DefTruth/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
DefTruth/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
DefTruth/cutlass
CUDA Templates for Linear Algebra Subroutines
DefTruth/flash-attention
Fast and memory-efficient exact attention
DefTruth/FlashMLA
FlashMLA: Efficient MLA Decoding Kernel for Hopper GPUs
DefTruth/InternVL
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
DefTruth/llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
DefTruth/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
DefTruth/MHA2MLA
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
DefTruth/chain-of-draft
Code and data for the Chain-of-Draft (CoD) paper
DefTruth/CogVideo
text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
DefTruth/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
DefTruth/flashinfer
FlashInfer: Kernel Library for LLM Serving
DefTruth/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLM
DefTruth/ParaAttention
Context parallel attention that accelerates DiT model inference with dynamic caching
DefTruth/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
DefTruth/sglang
SGLang is a structured generation language designed for large language models (LLMs). It makes your interaction with models faster and more controllable.
DefTruth/unlock-deepseek
DeepSeek 系列工作解读、扩展和复现。
DefTruth/xDiT
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism