ReArch Group Paper Reading List

 

Seminars

Spring 2021

Date Paper Title Presenter Notes
03.01 Training for Multi-resolution Inference Using Reusable Quantization Terms Cong Guo
03.08 Toward Efficient Interactions between Python and Native Libraries Yuxian Qiu
03.15 SpAtten: Efficient Natural Language Processing Yue Guan
03.22 X-Stream: Edge-centric Graph Processing using Streaming Partitions Zhihui Zhang
03.29 Loop Nested Optimization, Polyhedral Model and Micro-2020 Best Paper (Optimizing the Memory Hierarchy by Compositing Automatic Transformations on Computations and Data) Zihan Liu Slides
04.12 Defensive Approximation: Securing CNNs using Approximate Computing Yakai Wang Related Work
05.17 Commutative Data Reordering: A New Technique to Reduce Data Movement Energy on Sparse Inference Workloads Yangjie Zhou ISCA 2020
05.31 Large Graph Convolutional Network Training with GPU-Oriented Data Communication Architecture Zhihui Zhang VLDB 2021
06.07 DynamicViT: Efficient Vision Transformers with Dynamic Token Sparsification Yue Guan NeurIPS 2021

Summer 2021

Date Paper Title Presenter Notes
07.14 AKG: automatic kernel generation for neural processing units using polyhedral transformations (PLDI 2021) Yuxian Qiu Slides
07.21 Floating-Point Format and Quantization for Deep Learning Computation Cong Guo
07.28 P-OPT: Practical Optimal Cache Replacement for Graph Analytics Yangjie Zhou Slides
08.04 Rubik: A Hierarchical Architecture for Efficient Graph Neural Network Training Zhihui Zhang
08.11 A Useful Tool CKA: Similarity of Neural Network Representations Revisited and It's application: Uncovering How Neural Network Representations Vary with Width and Depth Zhengyi Li Slides
08.18 Ansor: Generating High-Performance Tensor Programs for Deep Learning Zihan Liu Slides

Fall 2021

Date Paper Title Presenter Notes
10.11 Adaptive numeric type for DNN quantization Cong Guo
10.18 Compiling Graph Applications for GPUs with GraphIt Yangjie Zhou Slides
11.01 TENET: A Framework for Modeling Tensor Dataflow Based on Relation-centric Notation Zihan Liu Slides
11.08 Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Zhengyi Li Slides (code: zdea)
11.22 Dynamic Tensor Rematerialization
Checkmate: Breaking The Memory Wall with Optimal Tensor Rematerialization
Yue Guan Slides
Slides
11.29 GraphPulse: An Event-Driven Hardware Accelerator for Asynchronous Graph Processing Zhihui Zhang Presentation
12.06 CheckFreq: Frequent, Fine-Grained DNN Checkpointing Guandong Lu Slides
12.13 PipeDream: generalized pipeline parallelism for DNN training Runzhe Chen Slides
12.20 Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters Yakai Wang Slides

Spring 2022

Date Paper Title Presenter Notes
3.10 Speculation Attack: Meltdown, Spectre, Pinned-Loads Zihan Liu Slides
3.24 SparTA: Deep-Learning Model Sparsity via Tensor-with-Sparsity-Attribute Yue Guan
3.31 ROLLER: Fast and Efficient Tensor Compilation for Deep Learning Yijia Diao Link
4.07 Adaptable Register File Organization for Vector Processors Zhihui Zhang
4.14 CORTEX: A COMPILER FOR RECURSIVE DEEP LEARNING MODELS Yangjie Zhou Slides
4.21 Zero-Knowledge Succinct Non-Interactive Argument of Knowledge Shuwen Lu Slides
5.05 Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning Runzhe Chen Slides

Fall 2022

Date Paper Title Presenter Notes
9.20 ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization Cong Guo Slides
9.27 X-cache: a modular architecture for domain-specific caches Zihan Liu Slides
10.18 Automatically Discovering ML Optimizations Yangjie Zhou Slides
11.8 Privacy Preserving Machine Learning--inference Zhengyi Li Slides
11.15 Dynamic Tensor Compilers Yijia Diao Slides

Spring 2023

Date Paper Title Presenter Notes
3.30 JUNO: Algorithm-Hardware Mapping Co-design for Efficient\Approximate Nearest Neighbour Search in High Dimensional Space Zihan Liu
4.6 LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale; SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models; Optimal Brain Compression: A Framework for Accurate Post-Training Quantization and Pruning; GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED; SparseGPT: Massive Language Models Can be Accurately Pruned in One-Shot; P-Tuning v2: Prompt Tuning Can Be Comparable to Fine-tuning Universally Across Scales and Tasks; Offsite-Tuning: Transfer Learning without Full Model; LoRA: Low-Rank Adaptation of Large Language Models Jiaming Tang Slides
4.13 SMG: Towards Efficient Execution and Adequate Encryption of Private DNN Inference via Secure Micro-Graph Zhengyi Li Slides
5.04 FlexGen and FlashAttention Yue Guan Slides
5.11 Multi-Tenant DNN Inference: Spatial GPU Sharing Yijia Diao Slides
5.25 Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion Yangjie Zhou TVMConf Video

Fall 2023

Date Paper Title Presenter Notes
9.21 GPU Warp Scheduling and Control Code Weiming Hu Slides
9.28 Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity Yue Guan Slides
10.12 Shared SIMD unit: Occamy, Two Out-of-Order Commit CPU: NOREBA and Orinoco Zihan Liu Slides
10.19 Multitasking on GPU: Preemption Yijia Diao Slides
10.26 SecretFlow-SPU: A Performant and User-Friendly Framework for Privacy-Preserving Machine Learning Zhengyi Li Slides
11.09 Efficient large-scale language model training on GPU clusters using megatron-LM; ZeRO: Memory Optimizations Toward Training Trillion Parameter Models; ZeRO-Offload: Democratizing Billion-Scale Model Training; ZeRO-Infinity: Breaking the GPU Memory Wall for Extreme Scale Deep Learning Jiale Xu Slides
11.16 ATOM: LOW-BIT QUANTIZATION FOR EFFICIENT AND ACCURATE LLM SERVING Haoyan Zhang Slides
12.07 WaveScalar;Think Fast: A Tensor Streaming Processor (TSP) for Accelerating Deep Learning Workloads Gonglin Xu Slides
12.14 Fast Inference from Transformers via Speculative Decoding;SpecInfer: Accelerating Generative Large Language Model Serving with Speculative Inference and Token Tree Verification;LLMCad: Fast and Scalable On-device Large Language Model Inference Changming Yu Slides
12.28 A Framework for Fine-Grained Synchronization of Dependent GPU Kernels;Fast Fine-Grained Global Synchronization on GPUs;AutoScratch: ML-Optimized Cache Management for Inference-Oriented GPUs Ziyu Huang Slides

Spring 2024

Date Paper Title Presenter Notes
03.21 Transparent GPU Sharing in Container Clouds for Deep Learning Workloads Yijia Diao Link
03.28 DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving Shuwen Lu Slide
05.09 8-bit Transformer Inference and Fine-tuning for Edge Accelerators Weiming Hu Slide

DNN Architecture

Link

 

Deep Learning Compiler

List Contributed by Zihan Liu

 

Past Architecture Papers

List Contributed by Jingwen Leng

 

MoE Related Papers

List Contributed by Shuwen Lu

Reading List From Other Groups