λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures

λDNN is a cost-efficient function resource provisioning framework to minimize the monetary cost and guarantee the performance for DDNN training workloads in serverless platforms.

Overview of λDNN

λDNN framework running on AWS Lambda and comprises two pieces of modules: a training performance predictor and a function resource provisioner. To guarantee the objective DDNN training time, the resource provisioner further identifies the cost-efficient serverless function resource provisioning plan. Once the cost-efficient resource provisioning plan is determined, the function allocator finally sets up a number of functions with an appropriate amount of memory.

Modeling DDNN Training Performance In Serverless Platforms

In general, the DNN model requires a number of iterations (denoted by k) to converge to an objective training loss value. Accordingly, the DDNN training time T can be calculated by summing up the loading time, and the computation time, as well as the communication time, which is given by

The loading time is calculated as

Given n provisioned functions, the computation time tcomp of model gradients is defined as

The data communication time is calculated as

The objective is to minimize the monetary cost of provisioned function resources, while guaranteeing the performance of DDNN training workloads. The optimization problem is formally defined as

Publication

Fei Xu, Yiling Qin, Li Chen, Zhi Zhou, Fangming Liu, “λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures,” to appeared in IEEE Transactions on Computers, 2021, DOI:10.1109/TC.2021.3054656.

kkqqqqqq/lambdadnn

λDNN: Achieving Predictable Distributed DNN Training with Serverless Architectures

Overview of λDNN

Modeling DDNN Training Performance In Serverless Platforms

Publication