inference-optimization
There are 41 repositories under inference-optimization topic.
google/XNNPACK
High-efficiency floating-point neural network inference operators for mobile, server, and Web
alibaba/BladeDISC
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads.
jiazhihao/TASO
The Tensor Algebra SuperOptimizer for Deep Learning
mit-han-lab/inter-operator-scheduler
[MLSys 2021] IOS: Inter-Operator Scheduler for CNN Acceleration
imedslab/pytorch_bn_fusion
Batch normalization fusion for PyTorch. This is an archived repository, which is not maintained.
ZFTurbo/Keras-inference-time-optimizer
Optimize layers structure of Keras model to reduce computation time
Rapternmn/PyTorch-Onnx-Tensorrt
A set of tool which would make your life easier with Tensorrt and Onnxruntime. This Repo is designed for YoloV3
BaiTheBest/SparseLLM
Official Repo for SparseLLM: Global Pruning of LLMs (NeurIPS 2024)
keli-wen/AGI-Study
The blog, read report and code example for AGI/LLM related knowledge.
lmaxwell/Armednn
cross-platform modular neural network inference library, small and efficient
ksm26/Efficiently-Serving-LLMs
Learn the ins and outs of efficiently serving Large Language Models (LLMs). Dive into optimization techniques, including KV caching and Low Rank Adapters (LoRA), and gain hands-on experience with Predibase’s LoRAX framework inference server.
Harly-1506/Faster-Inference-yolov8
Faster inference YOLOv8: Optimize and export YOLOv8 models for faster inference using OpenVINO and Numpy 🔢
vbdi/divprune
[CVPR 2025] DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models
grazder/template.cpp
A template for getting started writing code using GGML
amazon-science/llm-rank-pruning
LLM-Rank: A graph theoretical approach to structured pruning of large language models based on weighted Page Rank centrality as introduced by the related paper.
ccs96307/fast-llm-inference
Accelerating LLM inference with techniques like speculative decoding, quantization, and kernel fusion, focusing on implementing state-of-the-art research papers.
EZ-Optimium/Optimium
Your AI Catalyst: inference backend to maximize your model's inference performance
Bisonai/ncnn
Modified inference engine for quantized convolution using product quantization
effrosyni-papanastasiou/constrained-em
A constrained expectation-maximization algorithm for feasible graph inference.
zhliuworks/Fast-MobileNetV2
🤖️ Optimized CUDA Kernels for Fast MobileNetV2 Inference
amazon-science/mlp-rank-pruning
MLP-Rank: A graph theoretical approach to structured pruning of deep neural networks based on weighted Page Rank centrality as introduced by the related thesis.
BjornMelin/local-llm-workbench
🧠 A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.
sjlee25/batch-partitioning
Batch Partitioning for Multi-PE Inference with TVM (2020)
yester31/TensorRT_Examples
TensorRT in Practice: Model Conversion, Extension, and Advanced Inference Optimization
cedrickchee/pytorch-mobile-android
PyTorch Mobile: Android examples of usage in applications
kiritigowda/mivisionx-inference-analyzer
MIVisionX Python Inference Analyzer uses pre-trained ONNX/NNEF/Caffe models to analyze inference results and summarize individual image results
piotrostr/infer-trt
Interface for TensorRT engines inference along with an example of YOLOv4 engine being used.
shreyansh26/Accelerating-Cross-Encoder-Inference
Leveraging torch.compile to accelerate cross-encoder inference
aalbaali/LieBatch
Batch estimation on Lie groups
ankdeshm/inference-optimization
A compilation of various ML and DL models and ways to optimize the their inferences.
cedrickchee/pytorch-mobile-ios
PyTorch Mobile: iOS examples
Wb-az/YOLOv8-Image-detection
YOLOV8 - Object detection
booyasatoshi/quantum-annealer
This is research into optimizing the training and inference for AI models on CPUs using simulated quantum annealing algorithms
matteo-stat/transformers-nlp-ner-token-classification
This repo provides scripts for fine-tuning HuggingFace Transformers, setting up pipelines and optimizing token classification models for inference. They are based on my experience developing a custom chatbot, I’m sharing these in the hope they will help others to quickly fine-tune and use models in their projects! 😊
OneAndZero24/TRTTL
TensorRT C++ Template Library
Keshavpatel2/local-llm-workbench
🧠 A comprehensive toolkit for benchmarking, optimizing, and deploying local Large Language Models. Includes performance testing tools, optimized configurations for CPU/GPU/hybrid setups, and detailed guides to maximize LLM performance on your hardware.