/awesome-mlsys

A curated list of research in machine learning system. I also summarize some papers if I think they are really interesting.

MIT LicenseMIT

Maintenance Commit Activity Last Commit Ask Me Anything ! Awesome GitHub license GitHub stars

Awesome System for Machine Learning

Path to system for AI [Whitepaper You Must Read]

A curated list of research in machine learning system. Link to the code if available is also present. Now we have a team to maintain this project. You are very welcome to pull request by using our template.

AI system

General Resources

System for AI Papers (Ordered by Category)

Paper Ordered by Conference

Survey

  • Toward Highly Available, Intelligent Cloud and ML Systems [Slide]
  • A curated list of awesome System Designing articles, videos and resources for distributed computing, AKA Big Data. [GitHub]
  • awesome-production-machine-learning: A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning [GitHub]
  • Opportunities and Challenges Of Machine Learning Accelerators In Production [Paper]
    • Ananthanarayanan, Rajagopal, et al. "
    • 2019 {USENIX} Conference on Operational Machine Learning (OpML 19). 2019.
  • How (and How Not) to Write a Good Systems Paper [Advice]
  • Applied machine learning at Facebook: a datacenter infrastructure perspective [Paper]
    • Hazelwood, Kim, et al. (HPCA 2018)
  • Infrastructure for Usable Machine Learning: The Stanford DAWN Project
    • Bailis, Peter, Kunle Olukotun, Christopher Ré, and Matei Zaharia. (preprint 2017)
  • Hidden technical debt in machine learning systems [Paper]
    • Sculley, David, et al. (NIPS 2015)
  • End-to-end arguments in system design [Paper]
    • Saltzer, Jerome H., David P. Reed, and David D. Clark.
  • System Design for Large Scale Machine Learning [Thesis]
  • Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications [Paper]
    • Park, Jongsoo, Maxim Naumov, Protonu Basu et al. arXiv 2018
    • Summary: This paper presents a characterizations of DL models and then shows the new design principle of DL hardware.
  • A Berkeley View of Systems Challenges for AI [Paper]

Book

  • Computer Architecture: A Quantitative Approach [Must read]
  • Streaming Systems [Book]
  • Kubernetes in Action (start to read) [Book]
  • Machine Learning Systems: Designs that scale [Website]

Video

  • ScalaDML2020: Learn from the best minds in the machine learning community. [Video]
  • Jeff Dean: "Achieving Rapid Response Times in Large Online Services" Keynote - Velocity 2014 [YouTube]
  • From Research to Production with PyTorch [Video]
  • Introduction to Microservices, Docker, and Kubernetes [YouTube]
  • ICML Keynote: Lessons Learned from Helping 200,000 non-ML experts use ML [Video]
  • Adaptive & Multitask Learning Systems [Website]
  • System thinking. A TED talk. [YouTube]
  • Flexible systems are the next frontier of machine learning. Jeff Dean [YouTube]
  • Is It Time to Rewrite the Operating System in Rust? [YouTube]
  • InfoQ: AI, ML and Data Engineering [YouTube]
    • Start to watch.
  • Netflix: Human-centric Machine Learning Infrastructure [InfoQ]
  • SysML 2019: [YouTube]
  • ScaledML 2019: David Patterson, Ion Stoica, Dawn Song and so on [YouTube]
  • ScaledML 2018: Jeff Dean, Ion Stoica, Yangqing Jia and so on [YouTube] [Slides]
  • A New Golden Age for Computer Architecture History, Challenges, and Opportunities. David Patterson [YouTube]
  • How to Have a Bad Career. David Patterson (I am a big fan) [YouTube]
  • SysML 18: Perspectives and Challenges. Michael Jordan [YouTube]
  • SysML 18: Systems and Machine Learning Symbiosis. Jeff Dean [YouTube]

Course

Blog

  • Parallelizing across multiple CPU/GPUs to speed up deep learning inference at the edge [Amazon Blog]
  • Building Robust Production-Ready Deep Learning Vision Models in Minutes [Blog]
  • Deploy Machine Learning Models with Keras, FastAPI, Redis and Docker [Blog]
  • How to Deploy a Machine Learning Model -- Creating a production-ready API using FastAPI + Uvicorn [Blog] [GitHub]
  • Deploying a Machine Learning Model as a REST API [Blog]
  • Continuous Delivery for Machine Learning [Blog]
  • Kubernetes CheatSheets In A4 [GitHub]
  • A Gentle Introduction to Kubernetes [Blog]
  • Train and Deploy Machine Learning Model With Web Interface - Docker, PyTorch & Flask [GitHub]
  • Learning Kubernetes, The Chinese Taoist Way [GitHub]
  • Data pipelines, Luigi, Airflow: everything you need to know [Blog]
  • The Deep Learning Toolset — An Overview [Blog]
  • Summary of CSE 599W: Systems for ML [Chinese Blog]
  • Polyaxon, Argo and Seldon for Model Training, Package and Deployment in Kubernetes [Blog]
  • Overview of the different approaches to putting Machine Learning (ML) models in production [Blog]
  • Being a Data Scientist does not make you a Software Engineer [Part1] Architecting a Machine Learning Pipeline [Part2]
  • Model Serving in PyTorch [Blog]
  • Machine learning in Netflix [Medium]
  • SciPy Conference Materials (slides, repo) [GitHub]
  • 继Spark之后,UC Berkeley 推出新一代AI计算引擎——Ray [Blog]
  • 了解/从事机器学习/深度学习系统相关的研究需要什么样的知识结构? [Zhihu]
  • Learn Kubernetes in Under 3 Hours: A Detailed Guide to Orchestrating Containers [Blog] [GitHub]
  • data-engineer-roadmap: Learning from multiple companies in Silicon Valley. Netflix, Facebook, Google, Startups [GitHub]
  • TensorFlow Serving + Docker + Tornado机器学习模型生产级快速部署 [Blog]
  • Deploying a Machine Learning Model as a REST API [Blog]

Maintainer

  • Huaizheng Zhang, Nanyang Technological University. :octocat:
  • Yizheng Huang, Nanyang Technological University. Focus on Inference System Section :octocat:
  • Meng Shen, Nanyang Technological University. Focus on Edge AI system Section :octocat:
  • Weiming Zhuang,Nanyang Technological University & SenseTime. Focus on Federated Learning System :octocat: