Microsoft OpenPAI FrameworkController

As one standalone component of Microsoft OpenPAI, FrameworkController (FC) is built to orchestrate all kinds of applications on Kubernetes by a single controller, especially for DeepLearning applications.

These kinds of applications include but not limited to:

Stateless and Stateful Service:
- DeepLearning Serving: TensorFlow Serving, etc.
- Big Data Serving: HDFS, HBase, Kafka, Etcd, Nginx, etc.
Stateless and Stateful Batch:
- DeepLearning AllReduce Training: TensorFlow MultiWorkerMirrored Training, Horovod Training, etc.
- DeepLearning Elastic Training without Server: PyTorch Elastic Training with whole cluster shared etcd, etc.
- DeepLearning Batch/Offline Inference: PyTorch Inference, etc.
- Automated Machine Learning: NNI, etc.
- Big Data Batch Processing: Standalone Spark, KD-Tree Building, etc.
Any combination of above applications:
- DeepLearning ParameterServer Training: TensorFlow ParameterServer Training, etc.
- DeepLearning Interactive Training: TensorFlow with Jupyter Notebook, etc.
- DeepLearning Elastic Training with Server: PyTorch Elastic Training with per-application dedicated etcd, etc.
- DeepLearning Streaming/Online Inference: TensorFlow Inference with Streaming I/O, etc.
- DeepLearning Incremental/Online Training: TensorFlow Training with Streaming I/O, etc.
- Big Data Stream Processing: Standalone Flink, etc.

Why Need It

Problem

In the open source community, there are so many specialized Kubernetes Pod controllers which are built for a specific kind of application, such as Kubernetes StatefulSet Controller, Kubernetes Job Controller, KubeFlow TensorFlow Operator, KubeFlow PyTorch Operator. However, no one is built for all kinds of applications and combination of the existing ones still cannot support some kinds of applications. So, we have to learn, use, develop, deploy and maintain so many Pod controllers.

Solution

Build a General-Purpose Kubernetes Pod Controller: FrameworkController.

And then we can get below benefits from it:

Support Kubernetes official unsupported applications:
- Stateful Batch with Service applications, like TensorFlow ParameterServer Training on FC.
- ScaleUp/ScaleDown Tolerable Stateful Batch applications, like PyTorch Elastic Training on FC.
Only need to learn, use, develop, deploy and maintain a single controller
All kinds of applications can leverage almost all provided features and guarantees
All kinds of applications can be used through the same interface with a unified experience
If really required, only need to build specialized controllers on top of it, instead of building from scratch:
- The similar practice is also adopted by Kubernetes official controllers, such as the Kubernetes Deployment Controller is built on top of the Kubernetes ReplicaSet Controller.

Architecture

Feature

Framework Feature

A Framework represents an application with a set of Tasks:

Executed by Kubernetes Pod
Partitioned to different heterogeneous TaskRoles which share the same lifecycle
Ordered in the same homogeneous TaskRole by TaskIndex
With consistent identity {FrameworkName}-{TaskRoleName}-{TaskIndex} as PodName
With fine grained ExecutionType to Start/Stop the whole Framework
With fine grained RetryPolicy for each Task and the whole Framework
With fine grained FrameworkAttemptCompletionPolicy for each TaskRole
With PodGracefulDeletionTimeoutSec for each Task to tune Consistency vs Availability
With fine grained Status for each TaskAttempt/Task, each TaskRole and the whole FrameworkAttempt/Framework

Controller Feature

Highly generalized as it is built for all kinds of applications
Light-weight as it is only responsible for Pod orchestration
Well-defined Framework Consistency vs Availability, State Machine and Failure Model
Tolerate Pod/ConfigMap unexpected deletion, Node/Network/FrameworkController/Kubernetes failure
Support to specify how to classify and summarize Pod failures
Support to ScaleUp/ScaleDown Framework with Strong Safety Guarantee
Support to expose Framework and Pod history snapshots to external systems
Easy to leverage FrameworkBarrier to achieve light-weight Gang Execution and Service Discovery
Easy to leverage HiveDScheduler to achieve GPU Topology-Aware, Multi-Tenant, Priority and Gang Scheduling
Compatible with other Kubernetes features, such as Kubernetes Service, Gpu Scheduling, Volume, Logging
Idiomatic with Kubernetes official controllers, such as Pod Spec
Aligned with Kubernetes Controller Design Guidelines and API Conventions

Prerequisite

A Kubernetes cluster, v1.16.15 or above, on-cloud or on-premise.

Quick Start

Doc

Official Image

DockerHub

Related Project

Third Party Controller Wrapper

A specialized wrapper can be built on top of FrameworkController to optimize for a specific kind of application:

Microsoft OpenPAI Controller Wrapper (Job RestServer): A wrapper client optimized for AI applications
Microsoft AzureML Kubernetes Compute Controller Wrapper: A wrapper client optimized for AI applications: AzureML Kubernetes Compute or ITP (Integrated Training Platform) is built for both first party and third party users, and will be eventually leveraged by AML (Azure Machine Learning)
Microsoft DLWorkspace Controller Wrapper (Job Manager): A wrapper client optimized for AI applications
Microsoft NNI Controller Wrapper (TrainingService): A wrapper client optimized for AutoML applications

Recommended Kubernetes Scheduler

FrameworkController can directly leverage many Kubernetes Schedulers and among them we recommend these best fits:

Kubernetes Default Scheduler: A General-Purpose Kubernetes Scheduler
HiveDScheduler: A Kubernetes Scheduler Extender optimized for AI applications

Similar Offering On Other Cluster Manager

YARN FrameworkLauncher: Similar offering on Apache YARN

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

thedigitaloctopus/frameworkcontroller