luliyucoordinate's Stars
microsoft/cusync
microsoft/BitNet
Official inference framework for 1-bit LLMs
dimforge/wgmath
GPU scientific computing on every platform
Shenyi-Z/ToCa
Accelerating Diffusion Transformers with Token-wise Feature Caching
ChenMnZ/PrefixQuant
An algorithm for static activation quantization of LLMs
mit-han-lab/duo-attention
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
kyutai-labs/moshi
THU-DSP-LAB/ventus-gpgpu-verilog
GPGPU supporting RISCV-V, developed with verilog HDL
kevmo314/scuda
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
baidu-research/baidu-allreduce
thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
tpn/pdfs
Technically-oriented PDF Collection (Papers, Specs, Decks, Manuals, etc)
gpgpu-sim/gpgpu-sim_distribution
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
THU-DSP-LAB/ventus-gpgpu
GPGPU processor supporting RISCV-V extension, developed with Chisel HDL
thuml/depyf
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
GindaChen/FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
microsoft/VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
mobiusml/gemlite
Fast low-bit matmul kernels in Triton
Tencent/TurboTransformers
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
VivekPanyam/cudaparsers
Parsers for CUDA binary files
Cornell-RelaxML/qtip
chenzomi12/AISystem
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
Huage001/LinFusion
Official PyTorch and Diffusers Implementation of "LinFusion: 1 GPU, 1 Minute, 16K Image"
jbush001/NyuziProcessor
GPGPU microprocessor architecture
gojasper/flash-diffusion
⚡ Flash Diffusion ⚡: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation (AAAI 2025)
shadowpa0327/Palu
Code for Palu: Compressing KV-Cache with Low-Rank Projection
microsoft/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
bytedance/ABQ-LLM
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
IST-DASLab/Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity