JIANGJZ's Stars
ggerganov/llama.cpp
LLM inference in C/C++
microsoft/DeepSpeed
DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
mlc-ai/mlc-llm
Universal LLM Deployment Engine with ML Compilation
LiLittleCat/awesome-free-chatgpt
🆓免费的 ChatGPT 镜像网站列表,持续更新。List of free ChatGPT mirror sites, continuously updated.
ml-explore/mlx
MLX: An array framework for Apple silicon
Dao-AILab/flash-attention
Fast and memory-efficient exact attention
liguodongiot/llm-action
本项目旨在分享大模型相关技术原理以及实战经验。
mit-han-lab/streaming-llm
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
InternLM/lmdeploy
LMDeploy is a toolkit for compressing, deploying, and serving LLMs.
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
intel/intel-extension-for-transformers
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
S-LoRA/S-LoRA
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
facebookincubator/submitit
Python 3.8+ toolbox for submitting jobs to Slurm
ray-project/ray-llm
RayLLM - LLMs on Ray
CNugteren/CLBlast
Tuned OpenCL BLAS
alibaba/rtp-llm
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
hpcaitech/SwiftInfer
Efficient AI Inference & Serving
ROCm/composable_kernel
Composable Kernel: Performance Portable Programming Model for Machine Learning Tensor Operators
volcengine/veGiantModel
lambda7xx/awesome-AI-system
paper and its code for AI System
CNugteren/myGEMM
Code appendix to an OpenCL matrix-multiplication tutorial
opengear-project/GEAR
GEAR: An Efficient KV Cache Compression Recipefor Near-Lossless Generative Inference of LLM
eth-easl/orion
An interference-aware scheduler for fine-grained GPU sharing
amd/amd-lab-notes
AMD lab notes with code examples to demonstrate use of AMD GPUs
EmbeddedLLM/vllm-rocm
vLLM: A high-throughput and memory-efficient inference and serving engine for LLMs
yhoshi3/RaLLe
RaLLe: A Framework for Developing and Evaluating Retrieval-Augmented Large Language Models
Xtra-Computing/hacc_demo