/pai

Resource scheduling and cluster management for AI

Primary LanguageJavaScriptMIT LicenseMIT

Open Platform for AI (OpenPAI) alt text

English | 简体中文

As of December 2022, Microsoft, the initial developer, announced that they would no longer develop and maintain the OpenPAI platform. Nevertheless, OpenXPU will resume the development of new features, fix any errors if necessary and maintain the OpenPAI platform from this point onward.

                                                                                                                                                                                     
Marketplace Logo
 Web Portal VScode SDK
API
Services
User Authentication User/Group Management
Storage Management Cluster/Job Monitoring
Job Orchestration Job Scheduling
Job Runtime Job Error Analysis
Kubernetes Cluster Management
CPU/GPU/FPGA/InfiniBand

If you are looking for custom support from the OpenXPU team

OpenXPU Support Program

Table of Contents

When to consider OpenPAI

  1. When your organization needs to share powerful AI computing resources (GPU/FPGA farm, etc.) among teams.
  2. When your organization needs to share and reuse common AI assets like Model, Data, Environment, etc.
  3. When your organization needs an easy IT ops platform for AI.
  4. When you want to run a complete training pipeline in one place.
  5. When you want to run training & inference tasks in one place.

Why choose OpenPAI

The platform incorporates the mature design that has a proven track record in Microsoft's large-scale production environment, And OpenXPU improves OpenPAI with resource (GPU) virtualization capabilities.

Most complete solution and easy to extend

OpenPAI is a most complete solution for deep learning, support virtual cluster, compatible with Kubernetes eco-system, complete training pipeline at one cluster etc. OpenPAI is architected in a modular way: different module can be plugged in as appropriate. Here is the architecture of OpenPAI, highlighting technical innovations of the platform.

Support on-premises and easy to deploy

OpenPAI is a full stack solution. OpenPAI not only supports on-premises, hybrid, or public Cloud deployment but also supports single-box deployment for trial users.

Support popular AI frameworks and heterogeneous hardware

Pre-built container for popular AI frameworks. Easy to include heterogeneous hardware. Support Distributed training, such as distributed TensorFlow.

Virtualization improves efficiency and flexibility

XPU is a container-based GPU virtualization product that splits GPUs Dynamically and Fully On Demand at the OS Kernel Layer. OpenPAI can work smoothly with XPU technology, which not only improves the efficiency of computing resources, but also provides more flexible scheduling capabilities and higher task throughput.

Installation and user guide

OpenPAI manages computing resources and is optimized for deep learning. Through container technology, the computing hardware are decoupled with software, so that it's easy to run distributed jobs, switch with different deep learning frameworks, or run other kinds of jobs on consistent environments.

As OpenPAI is a platform, there are typically two different roles:

  • Cluster users are the consumers of the cluster's computing resources. According to the deployment scenarios, cluster users could be researchers of Machine Learning and Deep Learning, data scientists, lab teachers, students and so on.
  • Cluster administrators are the maintainers of computing resources. The administrators are responsible for the deployment and availability of the cluster.

OpenPAI provides end-to-end manuals for both cluster users and administrators.

For cluster administrators

The admin manual is a comprehensive guide for cluster administrators, it covers (but not limited to) the following contents:

For cluster users

The user manual is a guidance for cluster users, who could train and serve deep learning (and other) tasks on OpenPAI.

  • Job submission and monitoring. The quick start tutorial is a good start for learning how to train models on OpenPAI. And more examples and supports to multiple mainstream frameworks (out-of-the-box docker images) are in here. OpenPAI also provides supports for good debuggability and advanced job functionalities.

  • Data managements. Users could use cluster provisioned storages and custom storages in their jobs. The cluster provisioned storages are well integrated and easy to configure in a job (refer to here).

  • Collaboration and sharing. OpenPAI provides facilities for collaboration in teams and organizations. The cluster provisioned storages are organized by teams (groups). And users could easily share their works (e.g. jobs) in the marketplace, where others could discover and reproduce (clone) by one-click.

Besides the webportal, OpenPAI provides VS Code extension and command line tool (preview). The VS Code extension is a friendly, GUI based client tool of OpenPAI, and it's highly recommended. It's an extension of Visual Studio Code. It can submit job, simulate jobs locally, manage multiple OpenPAI environments, and so on.

Standalone Components

OpenPAI uses a modularized component design and organizes the code structure to 1 main repo together with 7 standalone key component repos. pai is the main repo, and the 7 component repos are:

  • hivedscheduler is a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters, which provides various advantages over standard k8s scheduler.
  • frameworkcontroller is built to orchestrate all kinds of applications on Kubernetes by a single controller.
  • openpai-protocol is the specification of OpenPAI job protocol.
  • openpai-runtime provides runtime support which is necessary for the OpenPAI protocol.
  • openpaisdk is a JavaScript SDK designed to facilitate the developers of OpenPAI to offer more user-friendly experience.
  • openpaimarketplace is a service which stores examples and job templates. Users can use it from webportal plugin to share their jobs or run-and-learn others' sharing job.
  • openpaivscode is a VSCode extension, which makes users connect OpenPAI clusters, submit AI jobs, simulate jobs locally and manage files in VSCode easily.

OpenPAI Manual

  • Detailed documents could be found in the OpenPAI Manual if you are interested.

Related Projects

Targeting at openness and advancing state-of-art technology, Microsoft Research (MSR) and Microsoft Software Technology Center Asia (STCA) had also released few other open source projects.

  • NNI : An open source AutoML toolkit for neural architecture search and hyper-parameter tuning. We encourage researchers and students leverage these projects to accelerate the AI development and research.
  • MMdnn : A comprehensive, cross-framework solution to convert, visualize and diagnose deep neural network models. The "MM" in MMdnn stands for model management and "dnn" is an acronym for deep neural network.
  • NeuronBlocks : An NLP deep learning modeling toolkit that helps engineers to build DNN models like playing Lego. The main goal of this toolkit is to minimize developing cost for NLP deep neural network model building, including both training and inference stages.
  • SPTAG : Space Partition Tree And Graph (SPTAG) is an open source library for large scale vector approximate nearest neighbor search scenario.

Get involved

How to contribute

Contributor License Agreement

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using the CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Contributors

One key purpose of OpenPAI is to support the highly diversified requirements from academia and industry. OpenPAI is completely open: it is under the MIT license. This makes OpenPAI particularly attractive to evaluate various research ideas, which include but not limited to the components.

OpenPAI operates in an open model. It was initially designed and developed by Microsoft Research (MSR) and Microsoft Software Technology Center Asia (STCA) platform team. Peking University, Xi'an Jiaotong University, Zhejiang University, University of Science and Technology of China and SHANGHAI INESA AI INNOVATION CENTER (SHAIIC) also develop the platform jointly.

After v1.8.1, Microsoft announced that they would no longer develop and maintain the OpenPAI platform. Since then, OpenXPU resumes the development of new features, fix any errors if necessary and maintain the OpenPAI platform from this point onward.

Contributions from academia and industry are all highly welcome.