chu-tianxiang's Stars
xai-org/grok-1
Grok open release
hpcaitech/Open-Sora
Open-Sora: Democratizing Efficient Video Production for All
LargeWorldModel/LWM
Large World Model With 1M Context
HVision-NKU/StoryDiffusion
Accepted as [NeurIPS 2024] Spotlight Presentation Paper
pytorch-labs/gpt-fast
Simple and efficient pytorch-native transformer text generation in <1000 LOC of python.
meta-llama/llama-agentic-system
Agentic components of the Llama Stack APIs
DefTruth/Awesome-LLM-Inference
📖A curated list of Awesome LLM Inference Paper with codes, TensorRT-LLM, vLLM, streaming-llm, AWQ, SmoothQuant, WINT8/4, Continuous Batching, FlashAttention, PagedAttention etc.
pytorch/torchtitan
A native PyTorch Library for large model training
predibase/lorax
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
baaivision/Emu
Emu Series: Generative Multimodal Models from BAAI
BBuf/how-to-optim-algorithm-in-cuda
how to optimize some algorithm in cuda.
noamgat/lm-format-enforcer
Enforce the output format (JSON Schema, Regex etc) of a language model
deepseek-ai/DeepSeek-LLM
DeepSeek LLM: Let there be answers
flashinfer-ai/flashinfer
FlashInfer: Kernel Library for LLM Serving
NVIDIA/cccl
CUDA Core Compute Libraries
Vahe1994/AQLM
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression https://arxiv.org/abs/2405.14852
google-research/deduplicate-text-datasets
kvcache-ai/Mooncake
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
zhuzilin/ring-flash-attention
Ring attention implementation with flash attention
Cornell-RelaxML/quip-sharp
mit-han-lab/qserve
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving
LLMServe/DistServe
Disaggregated serving system for Large Language Models (LLMs).
Bruce-Lee-LY/cuda_hgemm
Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.
KnowingNothing/MatmulTutorial
A Easy-to-understand TensorOp Matmul Tutorial
efeslab/Atom
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
spcl/QuaRot
Code for QuaRot, an end-to-end 4-bit inference of large language models.
AlibabaResearch/flash-llm
Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity
AILab-CVC/VL-GPT
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
chu-tianxiang/llama-cpp-torch
llama.cpp to PyTorch Converter
Superjomn/cuda-from-scratch