MARD1NO's Stars
excalidraw/excalidraw
Virtual whiteboard for sketching hand-drawn like diagrams
Delgan/loguru
Python logging made (stupidly) simple
THUDM/ChatGLM3
ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
deepseek-ai/DeepSeek-Coder
DeepSeek Coder: Let the Code Write Itself
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
S-LoRA/S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
google/maxtext
A simple, performant and scalable Jax LLM!
hao-ai-lab/LookaheadDecoding
chengzeyi/stable-fast
Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
AILab-CVC/UniRepLKNet
[CVPR'24] UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
Mq-b/Loser-HomeWork
卢瑟们的作业展示,答案讲解,以及一些C++知识
triton-inference-server/tensorrtllm_backend
The Triton TensorRT-LLM Backend
bobby-he/simplified_transformers
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
Dao-AILab/causal-conv1d
Causal depthwise conv1d in CUDA, with a PyTorch interface
OscarXZQ/weight-selection
microsoft/mscclpp
MSCCL++: A GPU-driven communication stack for scalable AI applications
DeepLangAI/LingoWhale-8B
LingoWhale-8B: Open Bilingual LLMs | 开源双语预训练大模型
MooreThreads/MobiMaliangSDK
AILab-CVC/GroupMixFormer
GroupMixAttention and GroupMixFormer
Bruce-Lee-LY/flash_attention_inference
Performance of the C++ interface of flash attention, flash attention v2 and self quantized decoding attention in large language model (LLM) inference scenarios.
Dao-AILab/fast-hadamard-transform
Fast Hadamard transform in CUDA, with a PyTorch interface
mit-han-lab/spatten-llm
[HPCA'21] SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning
bojone/FSQ
Keras implement of Finite Scalar Quantization
reed-lau/cute-gemm
gusye1234/chat-spot
A Spotlight app. You can talk and snip anything to ChatGPT at your finger-tips
tridao/cutlass_quant
Playing with quantization
hahnyuan/TorchQuantExtension
Pytorch extension for quantization with high-efficient CUDA kernels