JIANGJZ

JIANGJZ's Stars

ggerganov/llama.cpp
LLM inference in C/C++
Language:C++65.7k 547 3.8k9.4k
microsoft/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
Language:Python35k 342 2.7k4.1k
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python27.7k 228 4.7k4.1k
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
Language:Python18.8k 171 1.4k1.5k
LiLittleCat/awesome-free-chatgpt
🆓免费的 ChatGPT 镜像网站列表，持续更新。List of free ChatGPT mirror sites, continuously updated.
Language:Python17.9k 138 7511.3k
ml-explore/mlx
MLX: An array framework for Apple silicon
Language:C++16.6k 142 522953
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
Language:Python13.6k 115 1k1.2k
liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
Language:HTML9.4k 81 21918
mit-han-lab/streaming-llm
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
Language:Python6.6k 63 80362
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
Language:Python5.6k 63 98507
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
Language:Python4.3k 35 1.4k390
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
2.6k 86 6173
FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
Language:Jupyter Notebook2.2k 33 87153
intel/intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Language:Python2.1k 28 165209
S-LoRA/S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Language:Python1.7k 24 3991
facebookincubator/submitit
Python 3.8+ toolbox for submitting jobs to Slurm
Language:Python1.3k 24 118120
ray-project/ray-llm
RayLLM - LLMs on Ray
Language:Python1.2k 20 8991
CNugteren/CLBlast
Tuned OpenCL BLAS
Language:C++1k 58 325205
alibaba/rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
Language:C++521 12 8648
hpcaitech/SwiftInfer
Efficient AI Inference & Serving
Language:Python452 5 725
ROCm/composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
Language:C++297 24 216113
volcengine/veGiantModel
Language:Python205 9 623
lambda7xx/awesome-AI-system
paper and its code for AI System
204 7 313
CNugteren/myGEMM
Code appendix to an OpenCL matrix-multiplication tutorial
Language:C161 7 954
opengear-project/GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
Language:Python137 1 1911
eth-easl/orion
An interference-aware scheduler for fine-grained GPU sharing
Language:Python95 2 1715
amd/amd-lab-notes
AMD lab notes with code examples to demonstrate use of AMD GPUs
Language:C++89 22 38
EmbeddedLLM/vllm-rocm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
Language:Python87 2 135
yhoshi3/RaLLe
RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models
Language:Python53 3 14
Xtra-Computing/hacc_demo
Language:Shell17 5 13