AI-System

简体中文

This is an online AI System Course to help students learn the whole stack of systems that support AI, and practice them in the real projects. In this course, we will use terms AI-System and System for AI alternately.

Background

In recent years, the rapid development of artificial intelligence, especially deep learning technology, is inseparable from the continuous progress of hardware and software systems. In the foreseeable future, the development of artificial intelligence technology will still rely on a joint innovation model that combines computer systems and artificial intelligence. Computer systems are now empowering artificial intelligence with a larger scale and higher complexity. This requires not only more system innovation, but also systematic thinking and methodology. At the same time, artificial intelligence in turn provides support for the design of complex systems.

We have noticed that most of the current artificial intelligence-related courses, especially deep learning and machine learning related courses, mainly focus on related theories, algorithms or applications, but system-related courses are rare. We hope that the course of artificial intelligence systems can make artificial intelligence related education more comprehensive and in-depth, so as to jointly promote the cultivation of talents that intersect artificial intelligence and systems.

Purpose

This course aims to help students:

Completely understand the computer system architecture that supports deep learning, and learn the system design under the full life cycle of deep learning through practical problems.
Introduce cutting-edge systems and artificial intelligence research work, including AI for Systems and Systems for AI, to help senior undergraduates and graduate students better find and define meaningful research questions.
Design experimental courses from the perspective of system research. Encourage students to implement and optimize system modules by operating and applying mainstream and latest frameworks, platforms and tools to improve their ability to solve practical problems, not just understanding the use of tools.

Prerequisites: C/C++/Python, Computer Architecture, Introduction to algorithms

Characteristic

The course mainly includes the following three modules:

The first part is the basic knowledge of artificial intelligence and a full-stack overview of artificial intelligence systems; and the systematic design and methodology of deep learning systems.

The second part of the advanced courses includes the most cutting-edge systems and artificial intelligence research fields.

The third part is the supporting experimental courses, including the most mainstream frameworks, platforms and tools, and a series of experimental projects.

The content of the first part will focus on basic knowledge, while the content of the other two parts will be dynamically adjusted with the technological progress of academia and industry. The content of the latter two parts will be organized in a modular form to facilitate adjustment or combination with other CS courses (such as compilation principles, etc.) as advanced lectures or internship projects.

The design of this course will also draw on the research results and experience of Microsoft Research Asia in the intersection of artificial intelligence and systems, including some platforms and tools developed by Microsoft and the research institute. The course also encourages other schools and teachers to add and adjust more advanced topics or other experiments according to their needs.

Syllabus

Lectures

Lectures have two parts--basic courses and advanced courses. The first part is focus on basic theories, from lesson 1 to 6, while the second part involves more cutting-edge research, from lesson 7 to 14.

Basic Courses


Course No.	Lecture Name	Remarks
1	Introduction	Overview and system/AI basics
2	System perspective of Systems for AI	Systems for AI: a historic view; Fundamentals of neural networks; Fundamentals of Systems for AI
3	Computation frameworks for DNN	Backprop and AD, Tensor, DAG, Execution graph. Papers and systems: PyTorch, TensorFlow
4	Computer architecture for Matrix computation	Matrix computation, CPU/SIMD, GPGPU, ASIC/TPU Papers and systems: Blas, TPU
5	Distributed training algorithms	Data parallelism, model parallelism, distributed SGD Papers and systems: PipeDream
6	Distributed training systems	MPI, parameter servers, all-reduce, RDMA Papers and systems: Horovod
7	Scheduling and resource management system	Running dnn job on cluster: container, resource allocation, scheduling Papers and systems: Kubeflow, OpenPAI,Gandiva, HiveD
8	Inference systems	Efficiency, latency, throughput, and deployment Papers and systems: TensorRT, TensorflowLite, ONNX

Advanced Courses


Course No.	Course Name	Remarks
9	Computation graph compilation and optimization	IR, sub-graph pattern match, Matrix multiplication and memory optimization Papers and systems: XLA, MLIR, TVM, NNFusion
10	Efficiency via compression and sparsity	Model compression, Sparsity, Pruning
11	AutoML systems	Hyper parameter tuning, NAS Papers and systems: Hyperband, SMAC, ENAS, AutoKeras, NNI
12	Reinforcement learning systems	Theory of RL, systems for RL Papers and systems: AC3, RLlib, AlphaZero
13	Security and Privacy	Federated learning, security, privacy Papers and systems: DeepFake
14	AI for systems	AI for traditional systems problems, for system algorithms Papers and systems: Learned Indexes, Learned query path

Labs

Labs also have two parts: The first part is configured to make sure students can run most of Labs at local machine. The advanced part may need a small cluster (local or on Cloud) with GPU support.

Basic Labs


Lab No.	Lab Name	Remarks
Lab 1	A simple end-to-end AI example, from a system perspective	Understand the systems from debug info and system logs
Lab 2	Customize operators	Design and implement a customized operator (both forward and backward) in python
Lab 3	CUDA implementation	Add a CUDA implementation for the customized operator
Lab 4	AllReduce implementation	Improve AllReduce on Horovod: implement a lossy compression (3LC) on GPU for low-bandwidth network
Lab 5	Configure containers for customized training and inference	Configure containers

Advanced Labs


Lab No.	Lab Name	Remarks
Lab 6	Scheduling and resource management system	Get familiar with OpenPAI or KubeFlow
Lab 7	Distributed training	Try different kinds of all reduce implementations
Lab 8	AutoML	Search for a new neural network structure for Image/NLP tasks
Lab 9	RL Systems	Configure and get familiar with one of the following RL Systems: RLlib, …

appendix

The following lists the relevant courses in the direction of artificial intelligence systems in other schools and institutions.

<TBD>

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Legal Notices

Microsoft and any contributors grant you a license to the Microsoft documentation and other content in this repository under the Creative Commons Attribution 4.0 International Public License, see the LICENSE file, and grant you a license to any code in the repository under the MIT License, see the LICENSE-CODE file.

Microsoft, Windows, Microsoft Azure and/or other Microsoft products and services referenced in the documentation may be either trademarks or registered trademarks of Microsoft in the United States and/or other countries. The licenses for this project do not grant you rights to use any Microsoft names, logos, or trademarks. Microsoft's general trademark guidelines can be found at http://go.microsoft.com/fwlink/?LinkID=254653.

Privacy information can be found at https://privacy.microsoft.com/en-us/

Microsoft and any contributors reserve all other rights, whether under their respective copyrights, patents, or trademarks, whether by implication, estoppel or otherwise.

HuaizhengZhang/AI-System