Awesome-Paper

The trace of Paper Reading about DSA、GPU、LLM and AI System , remark "***" should be read careful and others are just look through.

DSA (Domain Deep Learning Accelerator)

*** Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks link

非常经典的CNN加速器，整体架构如下图

*** OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization link

量化的软硬件结合设计方法，用于解决Transformer中的离群点问题（即量化加速器设计，可以归结为一类问题），可见DSA中的LLM quantization and architecture co-design.md link，主要介绍大模型训练后量化的一些软硬件结合方法

GPU

LLM

【survey】

*** Full Stack Optimization of Transformer Inference: a Survey link

这是一篇关于大模型推理加速的综述

【hardware】

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU

一种可以降低LLM推理资源的方案，可以在16GB GPU中进行较大模型的推理(涉及到computation schedule, tensor placement, and computation delegation的策略)

code repository : link

Rethink Memory And Communication Costs for Efficient Large Language Model Training link

提出了一种新的大模型训练时的内存消耗与通信成本的优化方法PaRO，可以提高1.19 ~ 2.50X的训练吞吐量

Efficient LLM Inference on CPUs link

这篇论文介绍了如何在CPU上进行大模型的部署，英特尔设计的加速方法，支持量化到INT4，并实现了一个高效的开源库，但是看上去加速比其实并不是很高，主要的贡献点如下：

（1）We propose an automatic INT4 quantization flow and generate the high-quality INT4 models with negligible accuracy loss within <1% from FP32 baseline.
（2）We design a tensor library that supports general CPU instruction sets and latest instruction sets for deep learning acceleration. With CPU tensor library, we develop an efficient LLM runtime to accelerate the inference.
（3）We apply our inference solution to popular LLM models covering 3B to 20B and demonstrate the promising per-token generation latency from 20ms to 80ms, much faster than the average human reading speed about 200ms per token.