luliyucoordinate

Pytorch/TensorFlow/CUDA/HPC/more

hangzhou

luliyucoordinate's Stars

microsoft/cusync
Language:C++203
microsoft/BitNet
Official inference framework for 1-bit LLMs
Language:C++12.4k873
dimforge/wgmath
GPU scientific computing on every platform
Language:Rust413
Shenyi-Z/ToCa
Accelerating Diffusion Transformers with Token-wise Feature Caching
Language:Python331
ChenMnZ/PrefixQuant
An algorithm for static activation quantization of LLMs
Language:Python984
mit-han-lab/duo-attention
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Language:Python40922
kyutai-labs/moshi
Language:Python7k549
THU-DSP-LAB/ventus-gpgpu-verilog
GPGPU supporting RISCV-V, developed with verilog HDL
Language:Verilog7613
kevmo314/scuda
SCUDA is a GPU over IP bridge allowing GPUs on remote machines to be attached to CPU-only machines.
Language:C++57921
baidu-research/baidu-allreduce
Language:Cuda571114
thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Language:Cuda71636
tpn/pdfs
Technically-oriented PDF Collection (Papers, Specs, Decks, Manuals, etc)
Language:HTML7.8k1.5k
gpgpu-sim/gpgpu-sim_distribution
GPGPU-Sim provides a detailed simulation model of contemporary NVIDIA GPUs running CUDA and/or OpenCL workloads. It includes support for features such as TensorCores and CUDA Dynamic Parallelism as well as a performance visualization tool, AerialVisoin, and an integrated energy model, GPUWattch.
Language:C++1.2k516
THU-DSP-LAB/ventus-gpgpu
GPGPU processor supporting RISCV-V extension, developed with Chisel HDL
Language:Scala65477
thuml/depyf
depyf is a tool to help you understand and adapt to PyTorch compiler torch.compile.
Language:Python52615
GindaChen/FlexFlashAttention3
FlexAttention w/ FlashAttention3 Support
Language:Python272
microsoft/VPTQ
VPTQ, A Flexible and Extreme low-bit quantization algorithm
Language:Python54636
mobiusml/gemlite
Fast low-bit matmul kernels in Triton
Language:Python17713
Tencent/TurboTransformers
a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
Language:C++1.5k198
VivekPanyam/cudaparsers
Parsers for CUDA binary files
Language:Rust222
Cornell-RelaxML/qtip
Language:Python878
chenzomi12/AISystem
AISystem 主要是指AI系统，包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术
Language:Jupyter Notebook11.7k1.7k
Huage001/LinFusion
Official PyTorch and Diffusers Implementation of "LinFusion: 1 GPU, 1 Minute, 16K Image"
Language:Python26717
jbush001/NyuziProcessor
GPGPU microprocessor architecture
Language:C2k356
gojasper/flash-diffusion
⚡ Flash Diffusion ⚡: Accelerating Any Conditional Diffusion Model for Few Steps Image Generation (AAAI 2025)
Language:Python49839
shadowpa0327/Palu
Code for Palu: Compressing KV-Cache with Low-Rank Projection
Language:Python613
microsoft/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
Language:C++26643
bytedance/ABQ-LLM
An acceleration library that supports arbitrary bit-width combinatorial quantization operations
Language:C++23225
efeslab/Nanoflow
A throughput-oriented high-performance serving framework for LLMs
Language:Cuda66827
IST-DASLab/Sparse-Marlin
Boosting 4-bit inference kernels with 2:4 Sparsity
Language:Cuda614

luliyucoordinate

luliyucoordinate's Stars

microsoft/cusync

microsoft/BitNet

dimforge/wgmath

Shenyi-Z/ToCa

ChenMnZ/PrefixQuant

mit-han-lab/duo-attention

kyutai-labs/moshi

THU-DSP-LAB/ventus-gpgpu-verilog

kevmo314/scuda

baidu-research/baidu-allreduce

thu-ml/SageAttention

tpn/pdfs

gpgpu-sim/gpgpu-sim_distribution

THU-DSP-LAB/ventus-gpgpu

thuml/depyf

GindaChen/FlexFlashAttention3

microsoft/VPTQ

mobiusml/gemlite

Tencent/TurboTransformers

VivekPanyam/cudaparsers

Cornell-RelaxML/qtip

chenzomi12/AISystem

Huage001/LinFusion

jbush001/NyuziProcessor

gojasper/flash-diffusion

shadowpa0327/Palu

microsoft/mscclpp

bytedance/ABQ-LLM

efeslab/Nanoflow

IST-DASLab/Sparse-Marlin