Awesome Tensor Compilers
A list of awesome compiler projects and papers for tensor computation and deep learning.
Contents
Open Source Projects
- TVM: An End to End Machine Learning Compiler Framework
- MLIR: Multi-Level Intermediate Representation
- XLA: Optimizing Compiler for Machine Learning
- Halide: A Language for Fast, Portable Computation on Images and Tensors
- Glow: Compiler for Neural Network Hardware Accelerators
- nnfusion: A Flexible and Efficient Deep Neural Network Compiler
- Hummingbird: Compiling Trained ML Models into Tensor Computation
- Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
- Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
- TensorComprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
- PlaidML: A Platform for Making Deep Learning Work Everywhere
- BladeDISC: An End-to-End DynamIc Shape Compiler for Machine Learning Workloads
- TACO: The Tensor Algebra Compiler
- Nebulgym: Easy-to-use Library to Accelerate AI Training
- Nebullvm: Easy-to-use Library to Boost AI Inference
- NN-512: A Compiler That Generates C99 Code for Neural Net Inference
- DaCeML: A Data-Centric Compiler for Machine Learning
Papers
Survey
- The Deep Learning Compiler: A Comprehensive Survey by Mingzhen Li et al., TPDS 2020
- An In-depth Comparison of Compilers for DeepNeural Networks on Hardware by Yu Xing et al., ICESS 2019
Compiler and IR Design
- TensorIR: An Abstraction for Automatic Tensorized Program Optimization by Siyuan Feng, Bohan Hou et al., arXiv 2022
- Exocompilation for Productive Programming of Hardware Accelerators by Yuka Ikarashi, Gilbert Louis Bernstein et al., PLDI 2022
- DaCeML: A Data-Centric Compiler for Machine Learning by Oliver Rausch et al., ICS 22
- FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs by Shizhi Tang et al., PLDI 2022
- Roller: Fast and Efficient Tensor Compilation for Deep Learning by Hongyu Zhu et al., OSDI 2022
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
- Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction by Nicolas Vasilache et al., arXiv 2022
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
- MLIR: Scaling Compiler Infrastructure for Domain Specific Computation by Chris Lattner et al., CGO 2021
- A Tensor Compiler for Unified Machine Learning Prediction Serving by Supun Nakandala et al., OSDI 2020
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
- Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures by Tal Ben-Nun et al., SC 2019
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- Tiramisu: A polyhedral compiler for expressing fast and portable code by Riyadh Baghdadi et al., CGO 2019
- Triton: an intermediate language and compiler for tiled neural network computations by Philippe Tillet et al., MAPL 2019
- Relay: A High-Level Compiler for Deep Learning by Jared Roesch et al., arXiv 2019
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning by Tianqi Chen et al., OSDI 2018
- Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions by Nicolas Vasilache et al., arXiv 2018
- Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning by Scott Cyphers et al., arXiv 2018
- Glow: Graph Lowering Compiler Techniques for Neural Networks by Nadav Rotem et al., arXiv 2018
- DLVM: A modern compiler infrastructure for deep learning systems by Richard Wei et al., arXiv 2018
- Diesel: DSL for linear algebra and neural net computations on GPUs by Venmugil Elango et al., MAPL 2018
- The Tensor Algebra Compiler by Fredrik Kjolstad et al., OOPSLA 2017
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines by Jonathan Ragan-Kelley et al., PLDI 2013
Auto-tuning and Auto-scheduling
- One-shot tuner for deep learning compilers by Jaehun Ryu et al., CC 2022
- Autoscheduling for sparse tensor algebra with an asymptotic cost model by Peter Ahrens et al., PLDI 2022
- Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
- A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators by Dan Zhang et al., ASPLOS 2022
- Lorien: Efficient Deep Learning Workloads Delivery by Cody Hao Yu et al., SoCC 2021
- Value Learning for Throughput Optimization of Deep Neural Networks by Benoit Steiner et al., MLSys 2021
- A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers by Phitchaya Mangpo Phothilimthana et al., PACT 2021
- Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
- Schedule Synthesis for Halide Pipelines on GPUs by Sioutas Savvas et al., TACO 2020
- FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System by Size Zheng et al., ASPLOS 2020
- ProTuner: Tuning Programs with Monte Carlo Tree Search by Ameer Haj-Ali et al., arXiv 2020
- AdaTune: Adaptive tensor program compilation made efficient by Menghao Li et al., NeurIPS 2020
- Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data by Jie Zhao et al., MICRO 2020
- Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation by Byung Hoon Ahn et al., ICLR 2020
- A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra by Ryan Senanayake et al. OOPSLA 2020
- Learning to Optimize Halide with Tree Search and Random Programs by Andrew Adams et al., SIGGRAPH 2019
- Learning to Optimize Tensor Programs by Tianqi Chen et al., NeurIPS 2018
- Automatically Scheduling Halide Image Processing Pipelines by Ravi Teja Mullapudi et al., SIGGRAPH 2016
Cost Model
- An Asymptotic Cost Model for Autoscheduling Sparse Tensor Programs by Peter Ahrens et al., PLDI 2022
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng., NeurIPS 2021
- A Deep Learning Based Cost Model for Automatic Code Optimization by Riyadh Baghdadi et al., MLSys 2021
- A Learned Performance Model for the Tensor Processing Unit by Samuel J. Kaufman et al., MLSys 2021
- DYNATUNE: Dynamic Tensor Program Optimization in Deep Neural Network Compilation by Minjia Zhang et al., ICLR 2021
- MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks by Jaehun Ryu et al., arxiv 2021
CPU and GPU Optimization
- DeepCuts: A deep learning optimization framework for versatile GPU workloads by Wookeun Jung et al., PLDI 2021
- Analytical characterization and design space exploration for optimization of CNNs by Rui Li et al., ASPLOS 2021
- UNIT: Unifying Tensorized Instruction Compilation by Jian Weng et al., CGO 2021
- PolyDL: Polyhedral Optimizations for Creation of HighPerformance DL primitives by Sanket Tavarageri et al., arXiv 2020
- Fireiron: A Data-Movement-Aware Scheduling Language for GPUs by Bastian Hagedorn et al., PACT 2020
- Automatic Kernel Generation for Volta Tensor Cores by Somashekaracharya G. Bhaskaracharya et al., arXiv 2020
- Swizzle Inventor: Data Movement Synthesis for GPU Kernels by Phitchaya Mangpo Phothilimthana et al., ASPLOS 2019
- Optimizing CNN Model Inference on CPUs by Yizhi Liu et al., ATC 2019
- Analytical cache modeling and tilesize optimization for tensor contractions by Rui Li et al., SC 19
NPU Optimization
- AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction by Size Zheng et al., ISCA 2022
- Towards the Co-design of Neural Networks and Accelerators by Yanqi Zhou et al., MLSys 2022
- AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations by Jie Zhao et al., PLDI 2021
Graph-level Optimization
- Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization by Jie Zhao et al., MLSys 2022
- Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
- IOS: An Inter-Operator Scheduler for CNN Acceleration by Yaoyao Ding et al., MLSys 2021
- Optimizing DNN Computation Graph using Graph Substitutions by Jingzhi Fang et al., VLDB 2020
- Transferable Graph Optimizers for ML Compilers by Yanqi Zhou et al., NeurIPS 2020
- FusionStitching: Boosting Memory IntensiveComputations for Deep Learning Workloads by Zhen Zheng et al., arXiv 2020
- Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning by Woosuk Kwon et al., Neurips 2020
Dynamic Model
- DietCode: Automatic Optimization for Dynamic Tensor Programs by Bojian Zheng et al., MLSys 2022
- The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding by Pratik Fegade et al., MLSys 2022
- Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference by Haichen Shen et al., MLSys 2021
- DISC: A Dynamic Shape Compiler for Machine Learning Workloads by Kai Zhu et al., EuroMLSys 2021
- Cortex: A Compiler for Recursive Deep Learning Models by Pratik Fegade et al., MLSys 2021
Graph Neural Networks
- Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph by Zhiqiang Xie et al., MLSys 2022
- Seastar: vertex-centric programming for graph neural networks by Yidi Wu et al., Eurosys 2021
- FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems by Yuwei Hu et al., SC 2020
Distributed Computing
- SpDISTAL: Compiling Distributed Sparse Tensor Computations by Rohan Yadav et al., SC 2022
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng, Zhuohan Li, Hao Zhang et al., OSDI 2022
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
- Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning by Ningning Xie, Tamara Norman, Diminik Grewe, Dimitrios Vytiniotis et al., MLSys 2022
- DISTAL: The Distributed Tensor Algebra Compiler by Rohan Yadav et al., PLDI 2022
- GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arXiv 2021
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads by Abhinav Jangda et al., ASPLOS 2022
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arXiv 2021
- Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao et al., MLSys 2019
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
- Distributed Halide by Tyler Denniston et al., PPoPP 2016
Quantization and Sparsification
- SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning by Zihao Ye et al., arXiv 2022
- SparseLNR: Accelerating Sparse Tensor Computations Using Loop Nest Restructuring by Adhitha Dias et al., ICS 2022
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Ningxin Zheng et al., OSDI 2022
- Compiler Support for Sparse Tensor Computations in MLIR by Aart J.C. Bik et al., arXiv 2022
- Automatic Generation of High-Performance Quantized Machine Learning Kernels by Meghan Cowan et al., CGO 2020
Program Rewriting
- Verified tensor-program optimization via high-level scheduling rewrites by Amanda Liu et al., POPL 2022
- Pure Tensor Program Rewriting via Access Patterns (Representation Pearl) by Gus Smith et al., MAPL 2021
- Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
Verification and Testing
- Coverage-guided tensor compiler fuzzing with joint IR-pass mutation by Jiawei Liu et al., OOPSLA 2022
- End-to-End Translation Validation for the Halide Language by Basile Clément et al., OOPSLA 2022
- A comprehensive study of deep learning compiler bugs by Qingchao Shen et al., ESEC/FSE 2021
- Verifying and Improving Halide’s Term Rewriting System with Program Synthesis by Julie L. Newcomb et al., OOPSLA 2020
Tutorials
Contribute
We encourage all contributions to this repository. Open an issue or send a pull request.
Notes on the Link Format
We prefer using a link which points to a more informative page instead of a single pdf. For example, for arxiv papers, we prefer https://arxiv.org/abs/1802.04799 over https://arxiv.org/pdf/1802.04799.pdf. For OSDI papers, we prefer https://www.usenix.org/conference/osdi18/presentation/chen over https://www.usenix.org/system/files/osdi18-chen.pdf