A list of awesome compiler projects and papers for tensor computation and deep learning.
- TVM: An End to End Machine Learning Compiler Framework
- MLIR: Multi-Level Intermediate Representation
- XLA: Optimizing Compiler for Machine Learning
- Halide: A Language for Fast, Portable Computation on Images and Tensors
- Glow: Compiler for Neural Network Hardware Accelerators
- nnfusion: A Flexible and Efficient Deep Neural Network Compiler
- Hummingbird: Compiling Trained ML Models into Tensor Computation
- Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations
- AITemplate: A Python framework which renders neural network into high performance CUDA/HIP C++ code
- Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
- TensorComprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
- PlaidML: A Platform for Making Deep Learning Work Everywhere
- BladeDISC: An End-to-End DynamIc Shape Compiler for Machine Learning Workloads
- TACO: The Tensor Algebra Compiler
- Nebulgym: Easy-to-use Library to Accelerate AI Training
- Speedster: Automatically apply SOTA optimization techniques to achieve the maximum inference speed-up on your hardware
- NN-512: A Compiler That Generates C99 Code for Neural Net Inference
- DaCeML: A Data-Centric Compiler for Machine Learning
- The Deep Learning Compiler: A Comprehensive Survey by Mingzhen Li et al., TPDS 2020
- An In-depth Comparison of Compilers for DeepNeural Networks on Hardware by Yu Xing et al., ICESS 2019
- TensorIR: An Abstraction for Automatic Tensorized Program Optimization by Siyuan Feng, Bohan Hou et al., ASPLOS 2023
- Exocompilation for Productive Programming of Hardware Accelerators by Yuka Ikarashi, Gilbert Louis Bernstein et al., PLDI 2022
- DaCeML: A Data-Centric Compiler for Machine Learning by Oliver Rausch et al., ICS 22
- FreeTensor: A Free-Form DSL with Holistic Optimizations for Irregular Tensor Programs by Shizhi Tang et al., PLDI 2022
- Roller: Fast and Efficient Tensor Compilation for Deep Learning by Hongyu Zhu et al., OSDI 2022
- AStitch: Enabling a New Multi-dimensional Optimization Space for Memory-Intensive ML Training and Inference on Modern SIMT Architectures by Zhen Zheng et al., ASPLOS 2022
- Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction by Nicolas Vasilache et al., arXiv 2022
- PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections by Haojie Wang et al., OSDI 2021
- MLIR: Scaling Compiler Infrastructure for Domain Specific Computation by Chris Lattner et al., CGO 2021
- A Tensor Compiler for Unified Machine Learning Prediction Serving by Supun Nakandala et al., OSDI 2020
- Rammer: Enabling Holistic Deep Learning Compiler Optimizations with rTasks by Lingxiao Ma et al., OSDI 2020
- Stateful Dataflow Multigraphs: A Data-Centric Model for Performance Portability on Heterogeneous Architectures by Tal Ben-Nun et al., SC 2019
- TASO: The Tensor Algebra SuperOptimizer for Deep Learning by Zhihao Jia et al., SOSP 2019
- Tiramisu: A polyhedral compiler for expressing fast and portable code by Riyadh Baghdadi et al., CGO 2019
- Triton: an intermediate language and compiler for tiled neural network computations by Philippe Tillet et al., MAPL 2019
- Relay: A High-Level Compiler for Deep Learning by Jared Roesch et al., arXiv 2019
- TVM: An Automated End-to-End Optimizing Compiler for Deep Learning by Tianqi Chen et al., OSDI 2018
- Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions by Nicolas Vasilache et al., arXiv 2018
- Intel nGraph: An Intermediate Representation, Compiler, and Executor for Deep Learning by Scott Cyphers et al., arXiv 2018
- Glow: Graph Lowering Compiler Techniques for Neural Networks by Nadav Rotem et al., arXiv 2018
- DLVM: A modern compiler infrastructure for deep learning systems by Richard Wei et al., arXiv 2018
- Diesel: DSL for linear algebra and neural net computations on GPUs by Venmugil Elango et al., MAPL 2018
- The Tensor Algebra Compiler by Fredrik Kjolstad et al., OOPSLA 2017
- Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation in Image Processing Pipelines by Jonathan Ragan-Kelley et al., PLDI 2013
- One-shot tuner for deep learning compilers by Jaehun Ryu et al., CC 2022
- Autoscheduling for sparse tensor algebra with an asymptotic cost model by Peter Ahrens et al., PLDI 2022
- Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance by Jiarong Xing et al., MLSys 2022
- A Full-Stack Search Technique for Domain Optimized Deep Learning Accelerators by Dan Zhang et al., ASPLOS 2022
- Lorien: Efficient Deep Learning Workloads Delivery by Cody Hao Yu et al., SoCC 2021
- Value Learning for Throughput Optimization of Deep Neural Networks by Benoit Steiner et al., MLSys 2021
- A Flexible Approach to Autotuning Multi-Pass Machine Learning Compilers by Phitchaya Mangpo Phothilimthana et al., PACT 2021
- Ansor: Generating High-Performance Tensor Programs for Deep Learning by Lianmin Zheng et al., OSDI 2020
- Schedule Synthesis for Halide Pipelines on GPUs by Sioutas Savvas et al., TACO 2020
- FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System by Size Zheng et al., ASPLOS 2020
- ProTuner: Tuning Programs with Monte Carlo Tree Search by Ameer Haj-Ali et al., arXiv 2020
- AdaTune: Adaptive tensor program compilation made efficient by Menghao Li et al., NeurIPS 2020
- Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data by Jie Zhao et al., MICRO 2020
- Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation by Byung Hoon Ahn et al., ICLR 2020
- A Sparse Iteration Space Transformation Framework for Sparse Tensor Algebra by Ryan Senanayake et al. OOPSLA 2020
- Learning to Optimize Halide with Tree Search and Random Programs by Andrew Adams et al., SIGGRAPH 2019
- Learning to Optimize Tensor Programs by Tianqi Chen et al., NeurIPS 2018
- Automatically Scheduling Halide Image Processing Pipelines by Ravi Teja Mullapudi et al., SIGGRAPH 2016
- TLP: A Deep Learning-based Cost Model for Tensor Program Tuning by Yi Zhai et al., ASPLOS 2023
- An Asymptotic Cost Model for Autoscheduling Sparse Tensor Programs by Peter Ahrens et al., PLDI 2022
- TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers by Lianmin Zheng et al., NeurIPS 2021
- A Deep Learning Based Cost Model for Automatic Code Optimization by Riyadh Baghdadi et al., MLSys 2021
- A Learned Performance Model for the Tensor Processing Unit by Samuel J. Kaufman et al., MLSys 2021
- DYNATUNE: Dynamic Tensor Program Optimization in Deep Neural Network Compilation by Minjia Zhang et al., ICLR 2021
- MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks by Jaehun Ryu et al., arxiv 2021
- DeepCuts: A deep learning optimization framework for versatile GPU workloads by Wookeun Jung et al., PLDI 2021
- Analytical characterization and design space exploration for optimization of CNNs by Rui Li et al., ASPLOS 2021
- UNIT: Unifying Tensorized Instruction Compilation by Jian Weng et al., CGO 2021
- PolyDL: Polyhedral Optimizations for Creation of HighPerformance DL primitives by Sanket Tavarageri et al., arXiv 2020
- Fireiron: A Data-Movement-Aware Scheduling Language for GPUs by Bastian Hagedorn et al., PACT 2020
- Automatic Kernel Generation for Volta Tensor Cores by Somashekaracharya G. Bhaskaracharya et al., arXiv 2020
- Swizzle Inventor: Data Movement Synthesis for GPU Kernels by Phitchaya Mangpo Phothilimthana et al., ASPLOS 2019
- Optimizing CNN Model Inference on CPUs by Yizhi Liu et al., ATC 2019
- Analytical cache modeling and tilesize optimization for tensor contractions by Rui Li et al., SC 19
- AMOS: Enabling Automatic Mapping for Tensor Computations On Spatial Accelerators with Hardware Abstraction by Size Zheng et al., ISCA 2022
- Towards the Co-design of Neural Networks and Accelerators by Yanqi Zhou et al., MLSys 2022
- AKG: Automatic Kernel Generation for Neural Processing Units using Polyhedral Transformations by Jie Zhao et al., PLDI 2021
- Apollo: Automatic Partition-based Operator Fusion through Layer by Layer Optimization by Jie Zhao et al., MLSys 2022
- Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
- IOS: An Inter-Operator Scheduler for CNN Acceleration by Yaoyao Ding et al., MLSys 2021
- Optimizing DNN Computation Graph using Graph Substitutions by Jingzhi Fang et al., VLDB 2020
- Transferable Graph Optimizers for ML Compilers by Yanqi Zhou et al., NeurIPS 2020
- FusionStitching: Boosting Memory IntensiveComputations for Deep Learning Workloads by Zhen Zheng et al., arXiv 2020
- Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning by Woosuk Kwon et al., Neurips 2020
- DietCode: Automatic Optimization for Dynamic Tensor Programs by Bojian Zheng et al., MLSys 2022
- The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding by Pratik Fegade et al., MLSys 2022
- Nimble: Efficiently Compiling Dynamic Neural Networks for Model Inference by Haichen Shen et al., MLSys 2021
- DISC: A Dynamic Shape Compiler for Machine Learning Workloads by Kai Zhu et al., EuroMLSys 2021
- Cortex: A Compiler for Recursive Deep Learning Models by Pratik Fegade et al., MLSys 2021
- Graphiler: Optimizing Graph Neural Networks with Message Passing Data Flow Graph by Zhiqiang Xie et al., MLSys 2022
- Seastar: vertex-centric programming for graph neural networks by Yidi Wu et al., Eurosys 2021
- FeatGraph: A Flexible and Efficient Backend for Graph Neural Network Systems by Yuwei Hu et al., SC 2020
- SpDISTAL: Compiling Distributed Sparse Tensor Computations by Rohan Yadav et al., SC 2022
- Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning by Lianmin Zheng, Zhuohan Li, Hao Zhang et al., OSDI 2022
- Unity: Accelerating DNN Training Through Joint Optimization of Algebraic Transformations and Parallelization by Colin Unger, Zhihao Jia, et al., OSDI 2022
- Synthesizing Optimal Parallelism Placement and Reduction Strategies on Hierarchical Systems for Deep Learning by Ningning Xie, Tamara Norman, Diminik Grewe, Dimitrios Vytiniotis et al., MLSys 2022
- DISTAL: The Distributed Tensor Algebra Compiler by Rohan Yadav et al., PLDI 2022
- GSPMD: General and Scalable Parallelization for ML Computation Graphs by Yuanzhong Xu et al., arXiv 2021
- Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads by Abhinav Jangda et al., ASPLOS 2022
- OneFlow: Redesign the Distributed Deep Learning Framework from Scratch by Jinhui Yuan et al., arXiv 2021
- Beyond Data and Model Parallelism for Deep Neural Networks by Zhihao et al., MLSys 2019
- Supporting Very Large Models using Automatic Dataflow Graph Partitioning by Minjie Wang et al., EuroSys 2019
- Distributed Halide by Tyler Denniston et al., PPoPP 2016
- SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning by Zihao Ye et al., arXiv 2022
- SparseLNR: Accelerating Sparse Tensor Computations Using Loop Nest Restructuring by Adhitha Dias et al., ICS 2022
- SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute by Ningxin Zheng et al., OSDI 2022
- Compiler Support for Sparse Tensor Computations in MLIR by Aart J.C. Bik et al., arXiv 2022
- Automatic Generation of High-Performance Quantized Machine Learning Kernels by Meghan Cowan et al., CGO 2020
- Verified tensor-program optimization via high-level scheduling rewrites by Amanda Liu et al., POPL 2022
- Pure Tensor Program Rewriting via Access Patterns (Representation Pearl) by Gus Smith et al., MAPL 2021
- Equality Saturation for Tensor Graph Superoptimization by Yichen Yang et al., MLSys 2021
- Coverage-guided tensor compiler fuzzing with joint IR-pass mutation by Jiawei Liu et al., OOPSLA 2022
- End-to-End Translation Validation for the Halide Language by Basile Clément et al., OOPSLA 2022
- A comprehensive study of deep learning compiler bugs by Qingchao Shen et al., ESEC/FSE 2021
- Verifying and Improving Halide’s Term Rewriting System with Program Synthesis by Julie L. Newcomb et al., OOPSLA 2020
We encourage all contributions to this repository. Open an issue or send a pull request.
We prefer using a link which points to a more informative page instead of a single pdf. For example, for arxiv papers, we prefer https://arxiv.org/abs/1802.04799 over https://arxiv.org/pdf/1802.04799.pdf. For OSDI papers, we prefer https://www.usenix.org/conference/osdi18/presentation/chen over https://www.usenix.org/system/files/osdi18-chen.pdf