MeeCreeps's Stars
RWKV/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
AmberLJC/LLMSys-PaperList
Large Language Model (LLM) Systems Paper List
microsoft/ArchProbe
A profiler to disclose and quantify hardware features on GPUs.
google/uVkCompute
A micro Vulkan compute pipeline and a collection of benchmarking compute shaders
microsoft/T-MAC
Low-bit LLM inference on CPU with lookup table
pytorch/torchchat
Run PyTorch LLMs locally on servers, desktop and mobile
tpoisonooo/how-to-optimize-gemm
row-major matmul optimization
flame/how-to-optimize-gemm
google/spirv-tutor
SaschaWillems/VulkanCapsViewer
Vulkan hardware capability viewer
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
Mozilla-Ocho/llamafile
Distribute and run LLMs with a single file.
johnma2006/mamba-minimal
Simple, minimal implementation of the Mamba SSM in one file of PyTorch.
Syllo/nvtop
GPU & Accelerator process monitoring for AMD, Apple, Huawei, Intel, NVIDIA and Qualcomm
FMInference/DejaVu
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
NVIDIA/FasterTransformer
Transformer related optimization, including BERT, GPT
SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
lutzroeder/netron
Visualizer for neural network, deep learning and machine learning models
edufgf/CodeOptimizationTechniques
Implementation and benchmark of optimization techniques and algorithms applied to the Matrix Multiplication problem on a CPU/GPU multithreaded environment.
flame/blislab
BLISlab: A Sandbox for Optimizing GEMM
ggerganov/ggml
Tensor library for machine learning
alibaba/MNN
MNN is a blazing fast, lightweight deep learning framework, battle-tested by business-critical use cases in Alibaba
facebookresearch/seamless_communication
Foundational Models for State-of-the-Art Speech and Text Translation
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
flowtyone/flowty-realtime-lcm-canvas
A realtime sketch to image demo using LCM and the gradio library.
CNugteren/CLBlast
Tuned OpenCL BLAS
ermig1979/Simd
C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.
chenzomi12/AISystem
AISystem 主要是指AI系统,包括AI芯片、AI编译器、AI推理和训练框架等AI全栈底层技术