MARD1NO's Stars
radarFudan/Awesome-state-space-models
Collection of papers on state-space models
hahnyuan/ASVD4LLM
Activation-aware Singular Value Decomposition for Compressing Large Language Models
sneaxiy/AAdiffTools
microsoft/superbenchmark
A validation and profiling tool for AI infrastructure
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Dao-AILab/causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
bobby-he/simplified_transformers
MooreThreads/MobiMaliangSDK
microsoft/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
Dao-AILab/fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
OscarXZQ/weight-selection
gusye1234/chat-spot
A Spotlight app. You can talk and snip anything to ChatGPT at your finger-tips
AILab-CVC/UniRepLKNet
[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
hao-ai-lab/LookaheadDecoding
reed-lau/cute-gemm
AILab-CVC/GroupMixFormer
GroupMixAttention and GroupMixFormer
excalidraw/excalidraw
Virtual whiteboard for sketching hand-drawn like diagrams
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
S-LoRA/S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
DeepLangAI/LingoWhale-8B
LingoWhale-8B: Open Bilingual LLMs | 开源双语预训练大模型
deepseek-ai/DeepSeek-Coder
DeepSeek Coder: Let the Code Write Itself
mit-han-lab/spatten-llm
[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
bojone/FSQ
Keras implement of Finite Scalar Quantization
Mq-b/Loser-HomeWork
卢瑟们的作业展示,答案讲解,以及一些C++知识
hahnyuan/TorchQuantExtension
Pytorch extension for quantization with high-efficient CUDA kernels
THUDM/ChatGLM3
ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
google/maxtext
A simple, performant and scalable Jax LLM!
Delgan/loguru
Python logging made (stupidly) simple
triton-inference-server/tensorrtllm_backend
The Triton TensorRT-LLM Backend
NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.