quantization
There are 734 repositories under quantization topic.
hiyouga/LLaMA-Factory
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
ymcui/Chinese-LLaMA-Alpaca
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
SYSTRAN/faster-whisper
Faster Whisper transcription with CTranslate2
UFund-Me/Qbot
[🔥updating ...] AI 自动量化交易机器人(完全本地部署) AI-powered Quantitative Investment Research Platform. 📃 online docs: https://ufund-me.github.io/Qbot ✨ :news: qbot-mini: https://github.com/Charmve/iQuant
bitsandbytes-foundation/bitsandbytes
Accessible large language models via k-bit quantization for PyTorch.
kornelski/pngquant
Lossy PNG compressor — pngquant command based on libimagequant library
AutoGPTQ/AutoGPTQ
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
IntelLabs/distiller
Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
OpenNMT/CTranslate2
Fast inference engine for Transformer models
neuralmagic/deepsparse
Sparsity-aware deep learning inference runtime for CPUs
huawei-noah/Pretrained-Language-Model
Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.
IntelLabs/nlp-architect
A model library for exploring state-of-the-art deep learning topologies and techniques for optimizing Natural Language Processing neural networks
huggingface/optimum
🚀 Accelerate inference and training of 🤗 Transformers, Diffusers, TIMM and Sentence Transformers with easy to use hardware optimization tools
aaron-xichen/pytorch-playground
Base pretrained models and datasets in pytorch (MNIST, SVHN, CIFAR10, CIFAR100, STL10, AlexNet, VGG16, VGG19, ResNet, Inception, SqueezeNet)
stochasticai/xTuring
Build, customize and control you own LLMs. From data pre-processing to fine-tuning, xTuring provides an easy way to personalize open-source LLMs. Join our discord community: https://discord.gg/TgHXuSJEk6
intel/neural-compressor
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
dvmazur/mixtral-offloading
Run Mixtral-8x7B models in Colab or consumer desktops
quic/aimet
AIMET is a library that provides advanced quantization and compression techniques for trained neural network models.
666DZY666/micronet
micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape
Efficient-ML/Awesome-Model-Quantization
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (papers, repositories) that are missed by the repo.
pytorch/ao
PyTorch native quantization and sparsity for training and inference
intel/intel-extension-for-pytorch
A Python package for extending the official PyTorch that can easily obtain performance on Intel platform
OpenPPL/ppq
PPL Quantization Tool (PPQ) is a powerful offline neural network quantization tool.
PaddlePaddle/PaddleSlim
PaddleSlim is an open-source library for deep model compression and architecture search.
open-mmlab/mmrazor
OpenMMLab Model Compression Toolbox and Benchmark.
tensorflow/model-optimization
A toolkit to optimize ML models for deployment for Keras and TensorFlow, including quantization and pruning.
RWKV/rwkv.cpp
INT4/INT5/INT8 and FP16 inference on CPU for RWKV language model
Xilinx/brevitas
Brevitas: neural network quantization in PyTorch
RahulSChand/gpu_poor
Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization
huawei-noah/Efficient-Computing
Efficient computing methods developed by Huawei Noah's Ark Lab
thu-ml/SageAttention
Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
openvinotoolkit/training_extensions
Train, Evaluate, Optimize, Deploy Computer Vision Models via OpenVINO™
vllm-project/llm-compressor
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
mit-han-lab/nunchaku
[ICLR2025 Spotlight] SVDQuant: Absorbing Outliers by Low-Rank Components for 4-Bit Diffusion Models
openvinotoolkit/nncf
Neural Network Compression Framework for enhanced OpenVINO™ inference
adithya-s-k/AI-Engineering.academy
Mastering Applied AI, One Concept at a Time