ParaTra: A Parallel Transformer Inference Framework for GPUs in Edge Computing

Edge computing has been widely used to deploy and service deep learning applications. Equipped with GPUs, edge nodes can process concurrent incoming inference requests of the deep learning model. However, existing methods for inference tasks do not allow efficient parallel handling of user requests. This paper investigates the popular Transformer deep learning model and develops ParaTra, a parallel transformer inference framework for providing parallel inference services to users. In the framework, the Transformer model is partitioned and deployed in users’ devices and the edge node to efficiently utilize their processing power. The concurrent inference tasks with different sizes are dynamically packaged in a scheduling queue and sent in batch to an encoder-decoder pipeline for processing. ParaTra can significantly reduce the overheads of parallel processing and the usage of GPU memory. Experiment results show that ParaTra can save up to 37.1% of GPU memory usage and improve 8.4 times of processing speed.

cfl2005/ParaTra

ParaTra: A Parallel Transformer Inference Framework for GPUs in Edge Computing