/awesome-production-machine-learning

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

MIT LicenseMIT

Awesome Maintenance GitHub GitHub GitHub GitHub

Awesome Production Machine Learning

This repository contains a curated list of awesome open source libraries that will help you deploy, monitor, version, scale and secure your production machine learning ๐Ÿš€

Quick links to sections in this page

๐Ÿ” Explaining Predictions & Models ๐Ÿ” Privacy Preserving ML ๐Ÿ“œ Model & Data Versioning
๐Ÿ Model Training Orchestration ๐Ÿ’ช Model Serving & Monitoring ๐Ÿค– Neural Architecture Search
๐Ÿ““ Data Science Notebook ๐Ÿ“Š Industry-strength Visualisation ๐Ÿ”  Industry-strength NLP
๐Ÿงต Data pipeline ๐Ÿท๏ธ Data Labelling ๐Ÿ“… Metadata Management
๐Ÿ“ก Functions as a Service ๐Ÿ—บ๏ธ Computation Distribution ๐Ÿ“ฅ Model Serialisation
๐Ÿงฎ Optimized Computation ๐Ÿ’ธ Data Stream Processing ๐Ÿ”ด Outlier & Anomaly Detection
๐ŸŒ€ Feature Engineering ๐ŸŽ Feature Store โš” Adversarial Robustness
๐Ÿ’พ Data Storage Optimization ๐Ÿ’ฐ Commercial Platform

10 Min Video Overview

This 10 minute video provides an overview of the motivations for machine learning operations as well as a high level overview on some of the tools in this repo.

Want to receive recurrent updates on this repo and other advancements?

You can join the Machine Learning Engineer newsletter. Join over 10,000 ML professionals and enthusiasts who receive weekly curated articles & tutorials on production Machine Learning.
Also check out the Awesome Artificial Intelligence Guidelines List, where we aim to map the landscape of "Frameworks", "Codes of Ethics", "Guidelines", "Regulations", etc related to Artificial Intelligence.

Main Content

Explaining Black Box Models and Datasets

  • Aequitas - An open-source bias audit toolkit for data scientists, machine learning researchers, and policymakers to audit machine learning models for discrimination and bias, and to make informed and equitable decisions around developing and deploying predictive risk-assessment tools.
  • Alibi - Alibi is an open source Python library aimed at machine learning model inspection and interpretation. The initial focus on the library is on black-box, instance based model explanations.
  • anchor - Code for the paper "High precision model agnostic explanations", a model-agnostic system that explains the behaviour of complex models with high-precision rules called anchors.
  • captum - model interpretability and understanding library for PyTorch developed by Facebook. It contains general purpose implementations of integrated gradients, saliency maps, smoothgrad, vargrad and others for PyTorch models.
  • casme - Example of using classifier-agnostic saliency map extraction on ImageNet presented on the paper "Classifier-agnostic saliency map extraction".
  • ContrastiveExplanation (Foil Trees) - Python script for model agnostic contrastive/counterfactual explanations for machine learning. Accompanying code for the paper "Contrastive Explanations with Local Foil Trees".
  • DeepLIFT - Codebase that contains the methods in the paper "Learning important features through propagating activation differences". Here is the slides and the video of the 15 minute talk given at ICML.
  • DeepVis Toolbox - This is the code required to run the Deep Visualization Toolbox, as well as to generate the neuron-by-neuron visualizations using regularized optimization. The toolbox and methods are described casually here and more formally in this paper.
  • ELI5 - "Explain Like I'm 5" is a Python package which helps to debug machine learning classifiers and explain their predictions.
  • FACETS - Facets contains two robust visualizations to aid in understanding and analyzing machine learning datasets. Get a sense of the shape of each feature of your dataset using Facets Overview, or explore individual observations using Facets Dive.
  • Fairness Indicators - The tool supports teams in evaluating, improving, and comparing models for fairness concerns in partnership with the broader Tensorflow toolkit.
  • Fairlearn - Fairlearn is a python toolkit to assess and mitigate unfairness in machine learning models.
  • FairML - FairML is a python toolbox auditing the machine learning models for bias.
  • fairness - This repository is meant to facilitate the benchmarking of fairness aware machine learning algorithms based on this paper.
  • GEBI - Global Explanations for Bias Identification - An attention-based summarized post-hoc explanations for detection and identification of bias in data. We propose a global explanation and introduce a step-by-step framework on how to detect and test bias. Python package for image data.
  • AI Explainability 360 - Interpretability and explainability of data and machine learning models including a comprehensive set of algorithms that cover different dimensions of explanations along with proxy explainability metrics.
  • AI Fairness 360 - A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.
  • iNNvestigate - An open-source library for analyzing Keras models visually by methods such as DeepTaylor-Decomposition, PatternNet, Saliency Maps, and Integrated Gradients.
  • Integrated-Gradients - This repository provides code for implementing integrated gradients for networks with image inputs.
  • InterpretML - InterpretML is an open-source package for training interpretable models and explaining blackbox systems.
  • keras-vis - keras-vis is a high-level toolkit for visualizing and debugging your trained keras neural net models. Currently supported visualizations include: Activation maximization, Saliency maps, Class activation maps.
  • L2X - Code for replicating the experiments in the paper "Learning to Explain: An Information-Theoretic Perspective on Model Interpretation" at ICML 2018.
  • Lightly - A python framework for self-supervised learning on images. The learned representations can be used to analyze the distribution in unlabeled data and rebalance datasets.
  • Lightwood - A Pytorch based framework that breaks down machine learning problems into smaller blocks that can be glued together seamlessly with an objective to build predictive models with one line of code.
  • LIME - Local Interpretable Model-agnostic Explanations for machine learning models.
  • LOFO Importance - LOFO (Leave One Feature Out) Importance calculates the importances of a set of features based on a metric of choice, for a model of choice, by iteratively removing each feature from the set, and evaluating the performance of the model, with a validation scheme of choice, based on the chosen metric.
  • MindsDB - MindsDB is an Explainable AutoML framework for developers. With MindsDB you can build, train and use state of the art ML models in as simple as one line of code.
  • mljar-supervised - An Automated Machine Learning (AutoML) python package for tabular data. It can handle: Binary Classification, MultiClass Classification and Regression. It provides feature engineering, explanations and markdown reports.
  • NETRON - Viewer for neural network, deep learning and machine learning models.
  • pyBreakDown - A model agnostic tool for decomposition of predictions from black boxes. Break Down Table shows contributions of every variable to a final prediction.
  • responsibly - Toolkit for auditing and mitigating bias and fairness of machine learning systems
  • SHAP - SHapley Additive exPlanations is a unified approach to explain the output of any machine learning model.
  • SHAPash - Shapash is a Python library that provides several types of visualization that display explicit labels that everyone can understand.
  • Skater - Skater is a unified framework to enable Model Interpretation for all forms of model to help one build an Interpretable machine learning system often needed for real world use-cases.
  • WhatIf - An easy-to-use interface for expanding understanding of a black-box classification or regression ML model.
  • Tensorflow's cleverhans - An adversarial example library for constructing attacks, building defenses, and benchmarking both. A python library to benchmark system's vulnerability to adversarial examples.
  • tensorflow's lucid - Lucid is a collection of infrastructure and tools for research in neural network interpretability.
  • tensorflow's Model Analysis - TensorFlow Model Analysis (TFMA) is a library for evaluating TensorFlow models. It allows users to evaluate their models on large amounts of data in a distributed manner, using the same metrics defined in their trainer.
  • themis-ml - themis-ml is a Python library built on top of pandas and sklearn that implements fairness-aware machine learning algorithms.
  • Themis - Themis is a testing-based approach for measuring discrimination in a software system.
  • TreeInterpreter - Package for interpreting scikit-learn's decision tree and random forest predictions. Allows decomposing each prediction into bias and feature contribution components as described here.
  • woe - Tools for WoE Transformation mostly used in ScoreCard Model for credit rating
  • XAI - eXplainableAI - An eXplainability toolbox for machine learning.

Privacy Preserving ML

  • Flower - Flower is a Federated Learning Framework with a unified approach. It enables the federation of any ML workload, with any ML framework, and any programming language.
  • Google's Differential Privacy - This is a C++ library of ฮต-differentially private algorithms, which can be used to produce aggregate statistics over numeric data sets containing private or sensitive information.
  • Intel Homomorphic Encryption Backend - The Intel HE transformer for nGraph is a Homomorphic Encryption (HE) backend to the Intel nGraph Compiler, Intel's graph compiler for Artificial Neural Networks.
  • Microsoft SEAL - Microsoft SEAL is an easy-to-use open-source (MIT licensed) homomorphic encryption library developed by the Cryptography Research group at Microsoft.
  • OpenFL - OpenFL is a Python framework for Federated Learning. OpenFL is designed to be a flexible, extensible and easily learnable tool for data scientists. OpenFL is developed by Intel Internet of Things Group (IOTG) and Intel Labs.
  • PySyft - A Python library for secure, private Deep Learning. PySyft decouples private data from model training, using Multi-Party Computation (MPC) within PyTorch.
  • Rosetta - A privacy-preserving framework based on TensorFlow with customized backend Operations using Multi-Party Computation (MPC). Rosetta reuses the APIs of TensorFlow and allows to transfer original TensorFlow codes into a privacy-preserving manner with minimal changes.
  • Substra - Substra is an open-source framework for privacy-preserving, traceable and collaborative Machine Learning.
  • Tensorflow Privacy - A Python library that includes implementations of TensorFlow optimizers for training machine learning models with differential privacy.
  • TF Encrypted - A Framework for Confidential Machine Learning on Encrypted Data in TensorFlow.

Model and Data Versioning

  • Aim - A super-easy way to record, search and compare AI experiments.
  • Apache Marvin is a platform for model deployment and versioning that hides all complexity under the hood: data scientists just need to set up the server and write their code in an extended jupyter notebook.
  • Catalyst - High-level utils for PyTorch DL & RL research. It was developed with a focus on reproducibility, fast experimentation and code/ideas reusing.
  • ClearML - Auto-Magical Experiment Manager & Version Control for AI (previously Trains).
  • D6tflow - A python library that allows for building complex data science workflows on Python.
  • Data Version Control (DVC) - A git fork that allows for version management of models.
  • Deepkit - An open-source platform and cross-platform desktop application to execute, track, and debug modern machine learning experiments.
  • Dolt - Dolt is a SQL database that you can fork, clone, branch, merge, push and pull just like a git repository.
  • Flor - Easy to use logger and automatic version controller made for data scientists who write ML code.
  • Guild AI - Open source toolkit that automates and optimizes machine learning experiments.
  • Deeplake - Store, access & manage datasets with version-control for PyTorch/TensorFlow locally or on any cloud with scalable data pipelines.
  • Hangar - Version control for tensor data, git-like semantics on numerical data with high speed and efficiency.
  • Keepsake - Version control for machine learning.
  • lakeFS - Repeatable, atomic and versioned data lake on top of object storage.
  • MLflow - Open source platform to manage the ML lifecycle, including experimentation, reproducibility and deployment.
  • ModelDB - An open-source system to version machine learning models including their ingredients code, data, config, and environment and to track ML metadata across the model lifecycle.
  • ModelStore - An open-source Python library that allows you to version, export, and save a machine learning model to your cloud storage provider.
  • ormb - Docker for Your ML/DL Models Based on OCI Artifacts.
  • Pachyderm - Open source distributed processing framework build on Kubernetes focused mainly on dynamic building of production machine learning pipelines - (Video).
  • Polyaxon - A platform for reproducible and scalable machine learning and deep learning on kubernetes - (Video).
  • Quilt - Versioning, reproducibility and deployment of data and models.
  • Sacred - Tool to help you configure, organize, log and reproduce machine learning experiments.
  • Studio - Model management framework which minimizes the overhead involved with scheduling, running, monitoring and managing artifacts of your machine learning experiments.
  • TerminusDB - A graph database management system that stores data like git.

Model Training Orchestration

  • CML - Continuous Machine Learning (CML) is an open-source library for implementing continuous integration & delivery (CI/CD) in machine learning projects.
  • Determined - Deep learning training platform with integrated support for distributed training, hyperparameter tuning, and model management (supports Tensorflow and Pytorch).
  • envd - Machine learning development environment for data science and AI/ML engineering teams.
  • Flyte - Lyftโ€™s Cloud Native Machine Learning and Data Processing Platform - (Demo).
  • Hopsworks - Hopsworks is a data-intensive platform for the design and operation of machine learning pipelines that includes a Feature Store - (Video).
  • Kubeflow - A cloud native platform for machine learning based on Googleโ€™s internal machine learning pipelines.
  • MLeap - Standardisation of pipeline and model serialization for Spark, Tensorflow and sklearn.
  • NVIDIA TensorRT - TensorRT is a C++ library for high performance inference on NVIDIA GPUs and deep learning accelerators.
  • Onepanel - Production scale vision AI platform, with fully integrated components for model building, automated labeling, data processing and model training pipelines.
  • Open Platform for AI - Platform that provides complete AI model training and resource management capabilities.
  • PyCaret ) - low-code library for training and deploying models (scikit-learn, XGBoost, LightGBM, spaCy)
  • Skaffold - Skaffold is a command line tool that facilitates continuous development for Kubernetes applications. You can iterate on your application source code locally then deploy to local or remote Kubernetes clusters.
  • Tensorflow Extended (TFX) - Production oriented configuration framework for ML based on TensorFlow, incl. monitoring and model version management.
  • TonY - TonY is a framework to natively run deep learning jobs on Apache Hadoop. It currently supports TensorFlow, PyTorch, MXNet and Horovod.
  • ZenML - ZenML is an extensible, open-source MLOps framework to create reproducible ML pipelines with a focus on automated metadata tracking, caching, and many integrations to other tools.

Model Serving and Monitoring

  • Backprop - Backprop makes it simple to use, finetune, and deploy state-of-the-art ML models.
  • BentoML - BentoML is an open source framework for high performance ML model serving.
  • Cortex - Cortex is an open source platform for deploying machine learning modelsโ€”trained with any frameworkโ€”as production web services. No DevOps required.
  • Deepchecks - Deepchecks is an open source package for comprehensively validating your machine learning models and data with minimal effort during development, deployment or in production.
  • DeepDetect - Machine Learning production server for TensorFlow, XGBoost and Cafe models written in C++ and maintained by Jolibrain.
  • Evidently - Evidently helps analyze machine learning models during development, validation, or production monitoring. The tool generates interactive reports from pandas DataFrame.
  • ForestFlow - Cloud-native machine learning model server.
  • Jina - Cloud native search framework that supports to use deep learning/state of the art AI models for search.
  • KFServing - Serverless framework to deploy and monitor machine learning models in Kubernetes - (Video).
  • m2cgen - A lightweight library which allows to transpile trained classic machine learning models into a native code of C, Java, Go, R, PHP, Dart, Haskell, Rust and many other programming languages.
  • MLEM - Version and deploy your ML models following GitOps principles.
  • MLServer - An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more.
  • mltrace - a lightweight, open-source Python tool to get "bolt-on" observability in ML pipelines.
  • MLWatcher - MLWatcher is a python agent that records a large variety of time-serie metrics of your running ML classification algorithm. It enables you to monitor in real time.
  • Model Server for Apache MXNet (MMS) - A model server for Apache MXNet from Amazon Web Services that is able to run MXNet models as well as Gluon models (Amazon's SageMaker runs a custom version of MMS under the hood).
  • Mosec - A rust-powered and multi-stage pipelined model server which offers dynamic batching and more. Super easy to implement and deploy as micro-services.
  • OpenScoring - REST web service for the true real-time scoring (<1 ms) of Scikit-Learn, R and Apache Spark models.
  • Pandas Profiling - Creates HTML profiling reports from pandas DataFrame objects. It extends the pandas DataFrame with df.profile_report() for quick data analysis.
  • PredictionIO - An open source Machine Learning Server built on top of a state-of-the-art open source stack for developers and data scientists to create predictive engines for any machine learning task.
  • Redis-AI - A Redis module for serving tensors and executing deep learning models. Expect changes in the API and internals.
  • Seldon Core - Open source platform for deploying and monitoring machine learning models in kubernetes - (Video).
  • Tempo - Open source SDK that provides a unified interface to multiple MLOps projects that enable data scientists to deploy and productionise machine learning systems.
  • Tensorflow Serving - High-performant framework to serve Tensorflow models via grpc protocol able to handle 100k requests per second per core.
  • TorchServe - TorchServe is a flexible and easy to use tool for serving PyTorch models.
  • Transformer-deploy - Transformer-deploy is an efficient, scalable and enterprise-grade CPU/GPU inference server for Hugging Face transformer models.
  • Triton Inference Server - Triton is a high performance open source serving software to deploy AI models from any framework on GPU & CPU while maximizing utilization.
  • WhyLogs - Lightweight solution for profiling and monitoring your ML data pipeline end-to-end

Adversarial Robustness

  • AdvBox - A toolbox to generate adversarial examples that fool neural networks in PaddlePaddle, PyTorch, Caffe2, MxNet, Keras, TensorFlow, and Advbox can benchmark the robustness of machine learning models.
  • Adversarial DNN Playground - think TensorFlow Playground, but for Adversarial Examples! A visualization tool designed for learning and teaching - the attack library is limited in size, but it has a nice front-end to it with buttons you can press!
  • AdverTorch - library for adversarial attacks / defenses specifically for PyTorch.
  • Alibi Detect - alibi-detect is a Python package focused on outlier, adversarial and concept drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series. The outlier detection methods should allow the user to identify global, contextual and collective outliers.
  • Artificial Adversary AirBnB's library to generate text that reads the same to a human but passes adversarial classifiers.
  • CleverHans - library for testing adversarial attacks / defenses maintained by some of the most important names in adversarial ML, namely Ian Goodfellow (ex-Google Brain, now Apple) and Nicolas Papernot (Google Brain). Comes with some nice tutorials!
  • Counterfit - Counterfit is a command-line tool and generic automation layer for assessing the security of machine learning systems.
  • DEEPSEC - another systematic tool for attacking and defending deep learning models.
  • EvadeML - benchmarking and visualization tool for adversarial ML maintained by Weilin Xu, a PhD at University of Virginia, working with David Evans. Has a tutorial on re-implementation of one of the most important adversarial defense papers - feature squeezing (same team).
  • Foolbox - second biggest adversarial library. Has an even longer list of attacks - but no defenses or evaluation metrics. Geared more towards computer vision. Code easier to understand / modify than ART - also better for exploring blackbox attacks on surrogate models.
  • Adversarial Robustness Toolbox (ART)) - ART provides tools that enable developers and researchers to defend and evaluate Machine Learning models and applications against the adversarial threats of Evasion, Poisoning, Extraction, and Inference.
  • MIA - A library for running membership inference attacks (MIA) against machine learning models.
  • Nicolas Carliniโ€™s Adversarial ML reading list - not a library, but a curated list of the most important adversarial papers by one of the leading minds in Adversarial ML, Nicholas Carlini. If you want to discover the 10 papers that matter the most - I would start here.
  • Robust ML - another robustness resource maintained by some of the leading names in adversarial ML. They specifically focus on defenses, and ones that have published code available next to papers. Practical and useful.
  • TextFool - plausible looking adversarial examples for text generation.
  • Trickster - Library and experiments for attacking machine learning in discrete domains using graph search.

Neural Architecture Search

Data Science Notebook

  • Apache Zeppelin - Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more.
  • Binder - Binder hosts notebooks in an executable environment (for free).
  • H2O Flow - Jupyter notebook-like interface for H2O to create, save and re-use "flows".
  • Jupyter Notebooks - Web interface python sandbox environments for reproducible development
  • ML Workspace - All-in-one web IDE for machine learning and data science. Combines Jupyter, VS Code, Tensorflow, and many other tools/libraries into one Docker image.
  • .NET Interactive - .NET Interactive takes the power of .NET and embeds it into your interactive experiences.
  • Papermill - Papermill is a library for parameterizing notebooks and executing them like Python scripts.
  • Ploomber - Ploomber allows you to develop workflows in Jupyter and execute them in a distributed environment without code changes. It supports Kubernetes, AWS Batch, and Airflow.
  • Polynote - Polynote is an experimental polyglot notebook environment. Currently, it supports Scala and Python (with or without Spark), SQL, and Vega.
  • RMarkdown - The rmarkdown package is a next generation implementation of R Markdown based on Pandoc.
  • Stencila - Stencila is a platform for creating, collaborating on, and sharing data driven content. Content that is transparent and reproducible.
  • Voilร  - Voilร  turns Jupyter notebooks into standalone web applications that can e.g. be used as dashboards.

Industrial Strength Visualisation

  • Altair - Altair is a declarative statistical visualization library for Python.
  • Apache ECharts - Apache ECharts is a powerful, interactive charting and data visualization library for browser.
  • Bokeh - Bokeh is an interactive visualization library for Python that enables beautiful and meaningful visual presentation of data in modern web browsers.
  • Geoplotlib - geoplotlib is a python toolbox for visualizing geographical data and making maps.
  • ggplot2 - An implementation of the grammar of graphics for R.
  • gradio - Quickly create and share demos of models - by only writing Python. Debug models interactively in your browser, get feedback from collaborators, and generate public links without deploying anything.
  • matplotlib - A Python 2D plotting library which produces publication-quality figures in a variety of hardcopy formats and interactive environments across platforms.
  • Missingno - missingno provides a small toolset of flexible and easy-to-use missing data visualizations and utilities that allows you to get a quick visual summary of the completeness (or lack thereof) of your dataset.
  • PDPBox - This repository is inspired by ICEbox. The goal is to visualize the impact of certain features towards model prediction for any supervised learning algorithm.
  • Perspective Streaming pivot visualization via WebAssembly.
  • Pixiedust - PixieDust is a productivity tool for Python or Scala notebooks, which lets a developer encapsulate business logic into something easy for your customers to consume.
  • Plotly Dash - Dash is a Python framework for building analytical web applications without the need to write javascript.
  • Plotly.py - An interactive, open source, and browser-based graphing library for Python.
  • Plotly.NET - Plotly.NET provides functions for generating and rendering plotly.js charts in .NET programming languages.
  • PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
  • pygal - pygal is a dynamic SVG charting library written in Python.
  • Redash - Redash is anopen source visualisation framework that is built to allow easy access to big datasets leveraging multiple backends.
  • seaborn - Seaborn is a Python visualization library based on matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
  • Streamlit - Streamlit lets you create apps for your machine learning projects with deceptively simple Python scripts. It supports hot-reloading, so your app updates live as you edit and save your file.
  • Superset - A modern, enterprise-ready business intelligence web application.
  • TensorBoard - A visualization toolkit for machine learning experimentation that makes it easy to host, track, and share ML experiments.
  • yellowbrick - yellowbrick is a matplotlib-based model evaluation plots for scikit-learn and other machine learning libraries.

Industrial Strength NLP

  • AdaptNLP - Built atop Zalando Research's Flair and Hugging Face's Transformers library, AdaptNLP provides Machine Learning Researchers and Scientists a modular and adaptive approach to a variety of NLP tasks with an Easy API for training, inference, and deploying NLP-based microservices.
  • Blackstone - Blackstone is a spaCy model and library for processing long-form, unstructured legal text. Blackstone is an experimental research project from the Incorporated Council of Law Reporting for England and Wales' research lab, ICLR&D.
  • CTRL - A Conditional Transformer Language Model for Controllable Generation released by SalesForce.
  • Facebook's XLM - PyTorch original implementation of Cross-lingual Language Model Pretraining which includes BERT, XLM, NMT, XNLI, PKM, etc..
  • Flair - Simple framework for state-of-the-art NLP developed by Zalando which builds directly on PyTorch.
  • Github's Semantic - Github's text library for parsing, analyzing, and comparing source code across many languages .
  • GluonNLP - GluonNLP is a toolkit that enables easy text preprocessing, datasets loading and neural models building to help you speed up your Natural Language Processing (NLP) research.
  • Grover - Grover is a model for Neural Fake News -- both generation and detection. However, it probably can also be used for other generation tasks.
  • Kashgari - Kashgari is a simple and powerful NLP Transfer learning framework, build a state-of-art model in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS), and text classification tasks.
  • OpenAI GPT-2 - OpenAI's code from their paper "Language Models are Unsupervised Multitask Learners".
  • sense2vec - A Pytorch library that allows for training and using sense2vec models, which are models that leverage the same approach than word2vec, but also leverage part-of-speech attributes for each token, which allows it to be "meaning-aware".
  • Snorkel - Snorkel is a system for quickly generating training data with weak supervision.
  • SpaCy - Industrial-strength natural language processing library built with python and cython by the explosion.ai team.
  • Stable Baselines - A fork of OpenAI Baselines, implementations of reinforcement learning algorithms.
  • Tensorflow Lingvo - A framework for building neural networks in Tensorflow, particularly sequence models.
  • Tensorflow Text - TensorFlow Text provides a collection of text related classes and ops ready to use with TensorFlow 2.0.
  • YouTokenToMe - YouTokenToMe is an unsupervised text tokenizer focused on computational efficiency. It currently implements fast Byte Pair Encoding (BPE).
  • Transformers - Huggingface's library of state-of-the-art pretrained models for Natural Language Processing (NLP).

Data Pipeline

  • Apache Airflow - Data Pipeline framework built in Python, including scheduler, DAG definition and a UI for visualisation.
  • Apache Nifi - Apache NiFi was made for dataflow. It supports highly configurable directed graphs of data routing, transformation, and system mediation logic.
  • Argo Workflows - Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes. Argo Workflows is implemented as a Kubernetes CRD (Custom Resource Definition).
  • Azkaban - Azkaban is a batch workflow job scheduler created at LinkedIn to run Hadoop jobs. Azkaban resolves the ordering through job dependencies and provides an easy to use web user interface to maintain and track your workflows.
  • Basin - Visual programming editor for building Spark and PySpark pipelines.
  • Bonobo - ETL framework for Python 3.5+ with focus on simple atomic operations working concurrently on rows of data.
  • Chronos - More of a job scheduler for Mesos than ETL pipeline.
  • Couler - Unified interface for constructing and managing machine learning workflows on different workflow engines, such as Argo Workflows, Tekton Pipelines, and Apache Airflow.
  • Dagster - A data orchestrator for machine learning, analytics, and ETL.
  • DBT - ETL tool for running transformations inside data warehouses.
  • Flyte - Lyftโ€™s Cloud Native Machine Learning and Data Processing Platform - (Demo).
  • Genie - Job orchestration engine to interface and trigger the execution of jobs from Hadoop-based systems.
  • Gokart - Wrapper of the data pipeline Luigi.
  • Kedro - Kedro is a workflow development tool that helps you build data pipelines that are robust, scalable, deployable, reproducible and versioned. Visualization of the kedro workflows can be done by kedro-viz.
  • Luigi - Luigi is a Python module that helps you build complex pipelines of batch jobs, handling dependency resolution, workflow management, visualisation, etc..
  • Metaflow - A framework for data scientists to easily build and manage real-life data science projects.
  • Neuraxle - A framework for building neat pipelines, providing the right abstractions to chain your data transformation and prediction steps with data streaming, as well as doing hyperparameter searches (AutoML).
  • Oozie - Workflow scheduler for Hadoop jobs.
  • PipelineX - Based on Kedro and MLflow. Full comparison is found here.
  • Prefect Core - Workflow management system that makes it easy to take your data pipelines and add semantics like retries, logging, dynamic mapping, caching, failure notifications, and more.
  • SETL - A simple Spark-powered ETL framework that helps you structure your ETL projects, modularize your data transformation logic and speed up your development.
  • Snakemake - Workflow management system for reproducible and scalable data analyses.
  • Towhee - General-purpose machine learning pipeline for generating embedding vectors using one or many ML models.

Data Labelling

  • brat rapid annotation tool - Web-based text annotation tool for Named-Entity-Recogntion task.
  • COCO Annotator - Web-based image segmentation tool for object detection, localization and keypoints
  • Computer Vision Annotation Tool (CVAT) - OpenCV's web-based annotation tool for both VIDEOS and images for computer algorithms.
  • Doccano - Open source text annotation tools for humans, providing functionality for sentiment analysis, named entity recognition, and machine translation.
  • ImageTagger - Image labelling tool with support for collaboration, supporting bounding box, polygon, line, point labelling, label export, etc.
  • ImgLab - Image annotation tool for bounding boxes with auto-suggestion and extensibility for plugins.
  • Label Studio - Multi-domain data labeling and annotation tool with standardized output format.
  • Labelimg - Open source graphical image annotation tool writen in Python using QT for graphical interface focusing primarily on bounding boxes.
  • makesense.ai - Free to use online tool for labelling photos. Prepared labels can be downloaded in one of multiple supported formats.
  • MedTagger - A collaborative framework for annotating medical datasets using crowdsourcing.
  • OpenLabeling - Open source tool for labelling images with support for labels, edges, as well as image resizing and zooming in.
  • PixelAnnotationTool - Image annotation tool with ability to "colour" on the images to select labels for segmentation. Process is semi-automated with the watershed marked algorithm of OpenCV
  • Rubrix - Open-source tool for tracking, exploring, and labeling data for AI projects.
  • Semantic Segmentation Editor - Hitachi's Open source tool for labelling camera and LIDAR data.
  • Superintendent - superintendent provides an ipywidget-based interactive labelling tool for your data.
  • VGG Image Annotator (VIA) - A simple and standalone manual annotation software for image, audio and video. VIA runs in a web browser and does not require any installation or setup.

Metadata Management

  • Amundsen - Amundsen is a metadata driven application for improving the productivity of data analysts, data scientists and engineers when interacting with data.
  • Apache Atlas - Apache Atlas framework is an extensible set of core foundational governance services โ€“ enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
  • DataHub - DataHub is LinkedIn's generalized metadata search & discovery tool.
  • Marquez - Marquez is an open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata.
  • Metacat - Metacat is a unified metadata exploration API service. Metacat focusses on solving these three problems: 1) Federate views of metadata systems; 2) Allow arbitrary metadata storage about data sets; 3) Metadata discovery.
  • ML Metadata - a library for recording and retrieving metadata associated with ML developer and data scientist workflows.
  • Model Card Toolkit - streamlines and automates generation of Model Cards.

Data Storage Optimisation

  • Alluxio - A virtual distributed storage system that bridges the gab between computation frameworks and storage systems.
  • Apache Arrow - In-memory columnar representation of data compatible with Pandas, Hadoop-based systems, etc..
  • Apache Druid - A high performance real-time analytics database. Check this article for introduction.
  • Apache Ignite - A memory-centric distributed database, caching, and processing platform for transactional, analytical, and streaming workloads delivering in-memory speeds at petabyte scale - Demo.
  • Apache Parquet - On-disk columnar representation of data compatible with Pandas, Hadoop-based systems, etc..
  • Apache Pinot - A realtime distributed OLAP datastore. Comparison of the open source OLAP systems for big data: ClickHouse, Druid, and Pinot is found here.
  • BayesDB - A Bayesian database table for querying the probable implications of data as easily as SQL databases query the data itself. - (Video)
  • ClickHouse - ClickHouse is an open source column oriented database management system.
  • Delta Lake - Delta Lake is a storage layer that brings scalable, ACID transactions to Apache Spark and other big-data engines.
  • EdgeDB - NoSQL interface for Postgres that allows for object interaction to data stored.
  • HopsFS - HDFS-compatible file system with scale-out strongly consistent metadata.
  • InfluxDB Scalable datastore for metrics, events, and real-time analytics.
  • Milvus Milvus is a cloud-native, open-source vector database built to manage embedding vectors generated by machine learning models and neural networks.
  • Qdrant - An open source vector similarity search engine with extended filtering support.
  • TimescaleDB An open-source time-series SQL database optimized for fast ingest and complex queries packaged as a PostgreSQL extension - (Video).
  • Weaviate - A low-latency vector search engine (GraphQL, RESTful) with out-of-the-box support for different media types. Modules include Semantic Search, Q&A, Classification, Customizable Models (PyTorch/TensorFlow/Keras), and more.
  • Zarr - Python implementation of chunked, compressed, N-dimensional arrays designed for use in parallel computing.

Function as a Service

  • Apache OpenWhisk - Open source, distributed serverless platform that executes functions in response to events at any scale.
  • Fission - (Early Alpha) Serverless functions as a service framework on Kubernetes.
  • Hydrosphere Mist - Serverless proxy for Apache Spark clusters.
  • Hydrosphere ML Lambda - Open source model management cluster for deploying, serving and monitoring machine learning models and ad-hoc algorithms with a FaaS architecture.
  • KNative Serving - Kubernetes based serverless microservices with "scale-to-zero" functionality.
  • Nuclio - A high-performance "serverless" framework focused on data, I/O, and compute intensive workloads. It is well integrated with popular data science tools, such as Jupyter and Kubeflow; supports a variety of data and streaming sources; and supports execution over CPUs and GPUs.
  • OpenFaaS - Serverless functions framework with RESTful API on Kubernetes

Computation Load Distribution

  • Analytics Zoo - A unified Data Analytics and AI platform for distributed TensorFlow, Keras and PyTorch on Apache Spark/Flink & Ray.
  • Apache Spark MLlib - Apache Spark's scalable machine learning library in Java, Scala, Python and R.
  • Bagua - Bagua is a performant and flexible distributed training framework for PyTorch, providing a faster alternative to PyTorch DDP and Horovod. It supports advanced distributed training algorithms such as quantization and decentralization.
  • Beam Apache Beam is a unified programming model for Batch and Streaming.
  • BigDL - Deep learning framework on top of Spark/Hadoop to distribute data and computations across a HDFS system.
  • Colossal-AI - A unified deep learning system for big model era, which helps users to efficiently and quickly deploy large AI model training and inference.
  • Dask - Distributed parallel processing framework for Pandas and NumPy computations - (Video).
  • DEAP - A novel evolutionary computation framework for rapid prototyping and testing of ideas. It seeks to make algorithms explicit and data structures transparent. It works in perfect harmony with parallelisation mechanisms such as multiprocessing and SCOOP.
  • DeepSpeed - A deep learning optimization library (lightweight PyTorch wrapper) that makes distributed training easy, efficient, and effective.
  • Fiber - Distributed computing library for modern computer clusters from Uber.
  • Flashlight - A fast, flexible machine learning library written entirely in C++ from the Facebook AI Research and the creators of Torch, TensorFlow, Eigen and Deep Speech.
  • Hivemind - Decentralized deep learning in PyTorch.
  • Horovod - Uber's distributed training framework for TensorFlow, Keras, and PyTorch.
  • NumPyWren - Scientific computing framework build on top of pywren to enable numpy-like distributed computations.
  • PyWren - Answer the question of the "cloud button" for python function execution. It's a framework that abstracts AWS Lambda to enable data scientists to execute any Python function - (Video).
  • PyTorch Lightning - Lightweight PyTorch research framework that allows you to easily scale your models to GPUs and TPUs and use all the latest best practices, without the engineering boilerplate - (Video).
  • Ray - Ray is a flexible, high-performance distributed execution framework for machine learning (VIDEO).
  • TensorFlowOnSpark - TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
  • Vespa Vespa is an engine for low-latency computation over large data sets.

Model Serialisation

  • Java PMML API - Java libraries for consuming and producing PMML files containing models from different frameworks, including:
  • MMdnn - Cross-framework solution to convert, visualize and diagnose deep neural network models.
  • Neural Network Exchange Format (NNEF) - A standard format to store models across Torch, Caffe, TensorFlow, Theano, Chainer, Caffe2, PyTorch, and MXNet.
  • ONNX - Open Neural Network Exchange Format.
  • PFA - Created by the same organisation as PMML, the Predicted Format for Analytics is an emerging standard for statistical models and data transformation engines.
  • PMML - The Predictive Model Markup Language standard in XML - (Video).

Optimized Computation

  • CuDF - Built based on the Apache Arrow columnar memory format, cuDF is a GPU DataFrame library for loading, joining, aggregating, filtering, and otherwise manipulating data.
  • CuML - cuML is a suite of libraries that implement machine learning algorithms and mathematical primitives functions that share compatible APIs with other RAPIDS projects.
  • CuPy - An implementation of NumPy-compatible multi-dimensional array on CUDA. CuPy consists of the core multi-dimensional array class, cupy.ndarray, and many functions on it.
  • H2O-3 - Fast scalable Machine Learning platform for smarter applications: Deep Learning, Gradient Boosting & XGBoost, Random Forest, Generalized Linear Modeling (Logistic Regression, Elastic Net), K-Means, PCA, Stacked Ensembles, Automatic Machine Learning (AutoML), etc..
  • Jax - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
  • Modin - Speed up your Pandas workflows by changing a single line of code.
  • Nebullvm - Easy-to-use library to boost AI inference leveraging multiple deep learning compilers.
  • Numba - A compiler for Python array and numerical functions.
  • NumpyGroupies Optimised tools for group-indexing operations: aggregated sum and more
  • OpenVINOโ„ข integration with TensorFlow - Highly optimized Neural Network inference with Tensorflow on Intel platforms by adding a single line of code.
  • Vaex Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular datasets. Vaex uses memory mapping, zero memory copy policy and lazy computations for best performance (no memory wasted).
  • Vulkan Kompute - Blazing fast, lightweight and mobile phone-enabled Vulkan compute framework optimized for advanced GPU data processing usecases.
  • Weld High-performance runtime for data analytics applications, Here is an interview with Weldโ€™s main contributor.

Data Stream Processing

  • Apache Flink - Open source stream processing framework with powerful stream and batch processing capabilities.
  • Apache Samza - Distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
  • Brooklin - Distributed stream processing framework. It uses Apache Kafka for messaging, and Apache Hadoop YARN to provide fault tolerance, processor isolation, security, and resource management.
  • Faust - Streaming library built on top of Python's Asyncio library using the async kafka client inspired by the kafka streaming library.
  • Apache Spark - Micro-batch processing for streams using the apache spark framework as a backend supporting stateful exactly-once semantics.
  • Apache Kafka - Kafka client library for buliding applications and microservices where the input and output are stored in kafka clusters.

Outlier and Anomaly Detection

  • adtk - A Python toolkit for rule-based/unsupervised anomaly detection in time series.
  • Alibi-Detect - Algorithms for outlier and adversarial instance detection, concept drift and metrics.
  • dBoost - Outlier detection in heterogeneous datasets using automatic tuple expansion. Check this paper for further details.
  • Deequ - A library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.
  • Deep Anomaly Detection with Outlier Exposure - Outlier Exposure (OE) is a method for improving anomaly detection performance in deep learning models. Paper
  • PyOD - A Python Toolbox for Scalable Outlier Detection (Anomaly Detection).
  • SUOD (Scalable Unsupervised Outlier Detection) - An Acceleration System for Large-scale Anomaly/Outlier Detection.
  • Tensorflow Data Validation (TFDV) - Library for exploring and validating machine learning data.

Feature Engineering

  • auto-sklearn - Framework to automate algorithm and hyperparameter tuning for sklearn.
  • AutoGluon - Automated feature, model, and hyperparameter selection for tabular, image, and text data on top of popular machine learning libraries (Scikit-Learn, LightGBM, CatBoost, PyTorch, MXNet).
  • AutoML-GS - Automatic feature and model search with code generation in Python, on top of common data science libraries (tensorflow, sklearn, etc.).
  • automl - Automated feature engineering, feature/model selection, hyperparam. optimisation.
  • Colombus - A scalable framework to perform exploratory feature selection implemented in R.
  • Feature Engine - Feature-engine is a Python library that contains several transformers to engineer features for use in machine learning models.
  • Featuretools - An open source framework for automated feature engineering.
  • go-featureprocessing - A feature pre-processing framework in Go that matches functionality of sklearn.
  • keras-tuner - Keras Tuner is an easy-to-use, distributable hyperparameter optimization framework that solves the pain points of performing a hyperparameter search. Keras Tuner makes it easy to define a search space and leverage included algorithms to find the best hyperparameter values.
  • mljar-supervised - An Automated Machine Learning (AutoML) python package for tabular data. It can handle: Binary Classification, MultiClass Classification and Regression. It provides feature engineering, explanations and markdown reports.
  • sklearn-deap Use evolutionary algorithms instead of gridsearch in scikit-learn.
  • TPOT - Automation of sklearn pipeline creation (including feature selection, pre-processor, etc.).
  • tsfresh - Automatic extraction of relevant features from time series.
  • Upgini - Free automated data & feature enrichment library for machine learning: automatically searches through thousands of ready-to-use features from public and community shared data sources and enriches your training dataset with only the accuracy improving features.

Feature Store

  • Butterfree - A tool for building feature stores which allows you to transform your raw data into beautiful features.
  • Feature Store for Machine Learning (FEAST) - Feast (Feature Store) is a tool for managing and serving machine learning features. Feast is the bridge between models and data.
  • Featureform - A virtual featurestore. Plug-&-play with your existing infra. Data Scientist approved. Discovery, Governance, Lineage, & Collaboration just a pip install away. Supports pandas, Python, spark, SQL + integrations with major cloud vendors.
  • Hopsworks Feature Store - Offline/Online Feature Store for ML (Video).
  • Ivory - ivory defines a specification for how to store feature data and provides a set of tools for querying it. It does not provide any tooling for producing feature data in the first place. All ivory commands run as MapReduce jobs so it assumed that feature data is maintained on HDFS.
  • Veri - Veri is a Feature Label Store. Feature Label store allows storing features as keys and labels as values. Querying values is only possible with knn using features. Veri also supports creating sub sample spaces of data by default.

Commercial Platform

  • Amazon SageMaker - End-to-end machine learning development and deployment interface where you are able to build notebooks that use EC2 instances as backend, and then can host models exposed on an API.
  • Apheris - A platform for federated and privacy-preserving data science that lets you securely collaborate on AI with partners without sharing any data.
  • Arize AI - ML observability and automated model monitoring to help ML practitioners understand how their models perform in production, troubleshoot issues, and improve model performance. ML teams can upload offline (training or validation) baselines into an evaluation/inference store alongside online production data for model validation, drift detection, data quality checks, and model performance management.
  • BigML - A consumable, programmable, and scalable Machine Learning platform that makes it easy to solve and automate classification, regression, time series, etc..
  • Censius - Censius is an AI Observability Platform that assists enterprises in continuously monitoring, analyzing, and explaining their production models. It combines monitoring, accountability, and explainability into one Observability Platform.
  • Cnvrg.io - An end-to-end platform to manage, build and automate machine learning
  • Comet - Machine learning experiment management. Free for open source and students - (Video).
  • D2iQ Kaptain - An end-to-end machine learning platform built for security, scale, and speed, that allows enterprises to develop and deploy machine learning models that runs in the cloud, on premises (incl. air-gapped), in hybrid environments, or on the edge; based on Kubeflow and open-source Kubernetes Universal Declarative Operators (KUDO).
  • DAGsHub - Community platform for Open Source ML โ€“ Manage experiments, data & models and create collaborative ML projects easily.
  • Databricks - An integrated end-to-end machine learning environment incorporating managed services for experiment tracking, model training, feature development and management, and feature and model serving.
  • Dataiku - Collaborative data science platform powering both self-service analytics and the operationalization of machine learning models in production.
  • DataRobot - Automated machine learning platform which enables users to build and deploy machine learning models.
  • Datatron - Machine Learning Model Governance Platform for all your AI models in production for large Enterprises.
  • Deep Cognition Deep Learning Studio - E2E platform for deep learning.
  • deepsense Safety - AI-driven solution to increase worksite safety via safety procedure check, thread detection and hazardous zones monitoring.
  • deepsense Quality - Automating laborious quality control tasks.
  • Diffgram - Training Data First platform. Database & Training Data Pipelines for Supervised AI. Integrated with GCP, AWS, Azure and top Annotation Supervision UIs (or use built-in Diffgram UI, or build your own). Plus a growing list of integrated service providers! For Computer Vision, NLP, and Supervised Deep Learning / Machine Learning.
  • Domino - An enterprise MLOps platform that supports data scientist collaboration with their preferred tools, languages, and infrastructure, with IT central resource management, governance, and security, without vendor lock-in.
  • Google Cloud Machine Learning Engine - Managed service that enables developers and data scientists to build and bring machine learning models to production.
  • Graphsignal - Machine learning profiler that helps make model training and inference faster and more efficient.
  • H2O Driverless AI - Automates key machine learning tasks, delivering automatic feature engineering, model validation, model tuning, model selection and deployment, machine learning interpretability, bring your own recipe, time-series and automatic pipeline generation for model scoring - (Video).
  • IBM Watson Studio - Build and scale trusted AI on any cloud. Automate the AI lifecycle for ModelOps.
  • Iguazio Data Science Platform - Bring your Data Science to life by automating MLOps with end-to-end machine learning pipelines, transforming AI projects into real-world business outcomes, and supporting real-time performance at enterprise scale.
  • Iterative Studio - Seamless data and model management, experiment tracking, visualization and automation, with Git as the single source of truth.
  • Katonic.ai - Automate your cycle of Intelligence with Katonic MLOps Platform.
  • Labelbox - Image labelling service with support for semantic segmentation (brush & superpixels), bounding boxes and nested classifications.
  • Microsoft Azure Machine Learning service - Build, train, and deploy models from the cloud to the edge.
  • ModelOp - An enterprise MLOps platform that automates the governance, management and monitoring of deployed AI, ML models across platforms and teams, resulting in reliable, compliant and scalable AI initiatives.
  • MLJAR - Platform for rapid prototyping, developing and deploying machine learning models.
  • Neptune.ai - community-friendly platform supporting data scientists in creating and sharing machine learning models. Neptune facilitates teamwork, infrastructure management, models comparison and reproducibility.
  • Nimblebox - A full-stack MLOps platform designed to help data scientists and machine learning practitioners around the world discover, create, and launch multi-cloud apps from their web browser.
  • Prodigy - Active learning-based data annotation. Allows to train a model and pick most 'uncertain' samples for labeling from an unlabeled pool.
  • Robust Intelligence - Robust Intelligence is an end-to-end ML integrity solution that proactively eliminates failure at every stage of the model lifecycle. From pre-deployment vulnerability detection and validation to post-deployment monitoring and protection, Robust Intelligence gives teams the confidence to scale models in production across a variety of use cases and modalities.
  • Scribble Enrich - Customizable, auditable, privacy-aware feature store. It is designed to help mid-sized data teams gain trust in the data that they use for training and analysis, and support emerging needs such drift computation and bias assessment.
  • Skymind - Software distribution designed to help enterprise IT teams manage, deploy, and retrain machine learning models at scale.
  • Skytree - End to end machine learning platform - (Video).
  • Spell - Flexible end-to-end MLOps / Machine Learning Platform - (Video).
  • SuperAnnotate - A complete set of solutions for image and video annotation and an annotation service with integrated tooling, on-demand narrow expertise in various fields, and a custom neural network, automation, and training models powered by AI.
  • Superb AI - ML DataOps platform providing various tools to build, label, manage and iterate on training data.
  • Syndicai - Easy-to-use cloud agnostic platform that deploys, manages, and scales any trained AI model in minutes with no configuration & infrastructure setup.
  • Talend Studio - Data integration platform that provides various software and services for data integration, data management, enterprise application integration, data quality, cloud storage and Big Data.
  • Valohai - Machine orchestration, version control and pipeline management for deep learning.
  • Weights & Biases - Machine learning experiment tracking, dataset versioning, hyperparameter search, visualization, and collaboration.