lauthu's Stars
vllm-project/vllm
A high-throughput and memory-efficient inference and serving engine for LLMs
haotian-liu/LLaVA
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
huggingface/datasets
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
NVIDIA/nvidia-docker
Build and run Docker containers leveraging NVIDIA GPUs
microsoft/LoRA
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
magic-research/magic-animate
[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
karpathy/micrograd
A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API
mistralai/mistral-src
Reference implementation of Mistral AI 7B v0.1 model.
NVIDIA/TensorRT-LLM
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs. TensorRT-LLM also contains components to create Python and C++ runtimes that execute those TensorRT engines.
Dhghomon/easy_rust
Rust explained using easy English
SJTU-IPADS/PowerInfer
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
skypilot-org/skypilot
SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
allenai/OLMo
Modeling, training, eval, and inference code for OLMo
IntelLabs/distiller
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
turboderp/exllamav2
A fast inference library for running LLMs locally on modern consumer-class GPUs
ModelTC/lightllm
LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.
FasterDecoding/Medusa
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads
intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
IST-DASLab/gptq
Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".
microsoft/Olive
Olive: Simplify ML Model Finetuning, Conversion, Quantization, and Optimization for CPUs, GPUs and NPUs.
ionelmc/pytest-benchmark
py.test fixture for benchmarking code
mit-han-lab/smoothquant
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
microsoft/onnxruntime-inference-examples
Examples for using ONNX Runtime for machine learning inferencing.
triton-inference-server/tensorrtllm_backend
The Triton TensorRT-LLM Backend
Sunt-ing/database-system-readings
:yum: A curated reading list about database systems
facebookresearch/LLM-QAT
Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"
transformer-vq/transformer_vq
openppl-public/ppl.nn.llm
ROCm/flash-attention
Fast and memory-efficient exact attention
Syencil/Programming_Massively_Parallel_Processors
CUDA 6大并行计算模式 代码与笔记