/BentoML

Model Serving Made Easy

Primary LanguagePythonApache License 2.0Apache-2.0

pypi status Downloads Actions Status Documentation Status join BentoML Slack

BentoML

BentoML is an open-source framework for ML model serving, bridging the gap between Data Science and DevOps.

What does BentoML do?

  • Package models trained with any framework and reproduce them for model serving in production
  • Package once and deploy anywhere, supporting Docker, Kubernetes, Apache Spark, Airflow, Kubeflow, Knative, AWS Lambda, SageMaker, Azure ML, GCP, Heroku and more
  • High-Performance API model server with adaptive micro-batching support
  • Central hub for teams to manage and access packaged models via Web UI and API

👉 To connect with the community and ask questions, check out BentoML Discussions on Github and the BentoML Slack Community.


Why BentoML

Moving trained Machine Learning models to serving applications in production is hard. Data Scientists are not experts in building production services. The trained models they produced are loosely specified and hard to deploy. This often leads ML teams to a time-consuming and error-prone process, where a jupyter notebook along with pickle and protobuf file being handed over to ML engineers, for turning the trained model into services that can be properly deployed and managed by DevOps.

BentoML is framework for ML model serving. It provides high-level APIs for Data Scientists to create production-ready prediction services, without them worrying about the infrastructure needs and performance optimizations. BentoML does all those under the hood, which allows DevOps to seamlessly work with Data Science team, helping to deploy and operate their models, packaged in the BentoML format.

Check out Frequently Asked Questions page on how does BentoML compares to Tensorflow-serving, Clipper, AWS SageMaker, MLFlow, etc.

BentoML Feature Highlights

Online serving with API model server:

  • Containerized model server for production deployment with Docker, Kubernetes, OpenShift, AWS ECS, Azure, GCP GKE, etc
  • Adaptive micro-batching for optimal online serving performance
  • Discover and package all dependencies automatically, including PyPI, conda packages and local python modules
  • Support multiple ML frameworks including PyTorch, Tensorflow, Scikit-Learn, XGBoost, and many more
  • Serve compositions of multiple models
  • Serve multiple endpoints in one model server
  • Serve any Python code along with trained models
  • Automatically generate HTTP API spec in Swagger/OpenAPI format
  • Prediction logging and feedback logging endpoint
  • Health check endpoint and Prometheus /metrics endpoint for monitoring
  • Model serving via gRPC endpoint (roadmap)

Advanced workflow for model serving and deployment:

  • Central repository for managing all your team's packaged models via Web UI and API
  • Launch inference run from CLI or Python, which enables CI/CD testing, programmatic access and batch offline inference job
  • Distributed batch job or streaming job with Apache Spark (requires manual setup, better support for this is on roadmap)
  • Automated deployment with cloud platforms including AWS Lambda, AWS SageMaker, and Azure Functions
  • Advanced model deployment workflow on Kubernetes cluster, including auto-scaling, scale-to-zero, A/B testing, canary deployment, and multi-armed-bandit (roadmap)
  • Deep integration with ML experimentation platforms including MLFlow, Kubeflow (roadmap)

Getting Started

Run this Getting Started guide on Google Colab: Google Colab Badge

BentoML requires python 3.6 or above, install with pip:

pip install bentoml

Before starting, let's prepare a trained model for serving with BentoML.

Install required dependencies to run the example code:

pip install scikit-learn pandas

Train a classifier model on the Iris data set:

from sklearn import svm
from sklearn import datasets

# Load training data
iris = datasets.load_iris()
X, y = iris.data, iris.target

# Model Training
clf = svm.SVC(gamma='scale')
clf.fit(X, y)

Here's what a minimal prediction service in BentoML looks like:

import pandas as pd

from bentoml import env, artifacts, api, BentoService
from bentoml.adapters import DataframeInput
from bentoml.artifact import SklearnModelArtifact

@env(auto_pip_dependencies=True)
@artifacts([SklearnModelArtifact('model')])
class IrisClassifier(BentoService):

    @api(input=DataframeInput())
    def predict(self, df: pd.DataFrame):
        # Optional pre-processing, post-processing code goes here
        return self.artifacts.model.predict(df)

This code defines a prediction service that packages a scikit-learn model and provides an inference API that expects a pandas.Dataframe object as its input. BentoML also supports other API input data types including JsonInput, ImageInput, FileInput and more.

In BentoML, all inference APIs are suppose to accept a list of inputs and return a list of results. In the case of DataframeInput, each row of the dataframe is mapping to one prediction request received from the client. BentoML will convert HTTP JSON requests into pandas.DataFrame object before passing it to the user-defined inference API function.

This design allows BentoML to group API requests into small batches while serving online traffic. Comparing to a regular flask or FastAPI based model server, this can increases the overall throughput of the API server by 10-100x depending on the workload.

The following code packages the trained model with the prediction service class IrisClassifier defined above, and then saves the IrisClassifier instance to disk in the BentoML format for distribution and deployment:

# import the IrisClassifier class defined above
from iris_classifier import IrisClassifier

# Create a iris classifier service instance
iris_classifier_service = IrisClassifier()

# Pack the newly trained model artifact
iris_classifier_service.pack('model', clf)

# Save the prediction service to disk for model serving
saved_path = iris_classifier_service.save()

BentoML stores all packaged model files under the ~/bentoml/{service_name}/{service_version} directory by default. The BentoML file format contains all the code, files, and configs required to deploy the model for serving.

To start a REST API model server with the IrisClassifier saved above, use the bentoml serve command:

bentoml serve IrisClassifier:latest

The IrisClassifier model is now served at localhost:5000. Use curl command to send a prediction request:

$ curl -i \
  --header "Content-Type: application/json" \
  --request POST \
  --data '[[5.1, 3.5, 1.4, 0.2]]' \
  http://localhost:5000/predict

Or with python and request library:

import requests
response = requests.post("http://127.0.0.1:5000/predict", json=[[5.1, 3.5, 1.4, 0.2]])
print(response.text)

Note that BentoML API server automatically converts the Dataframe JSON format into a pandas.DataFrame object before sending it to the user-defined inference API function.

The BentoML API server also provides a simple web UI dashboard. Go to http://localhost:5000 in the browser and use the Web UI to send prediction request:

One common way of distributing this model API server for production deployment, is via Docker containers. And BentoML provides a convenient way to do that.

If you already have docker configured, run the following command to build a docker container image for serving the IrisClassifier prediction service created above:

$ bentoml containerize IrisClassifier:latest -t iris-classifier

Start a container with the docker image built from the previous step:

$ docker run -p 5000:5000 iris-classifier --enable-microbatch --workers=1

Continue reading the getting started guide here.

Documentation

BentoML full documentation: https://docs.bentoml.org/

Frameworks

BentoML supports these ML frameworks out-of-the-box:

Examples Gallery

Visit bentoml/gallery repository for list of example ML projects built with BentoML.

Deployment guides:

Contributing

Have questions or feedback? Post a new github issue or discuss in our Slack channel: join BentoML Slack

Want to help build BentoML? Check out our contributing guide and the development guide.

Releases

BentoML is under active development and is evolving rapidly. Currently it is a Beta release, we may change APIs in future releases.

Read more about the latest features and changes in BentoML from the releases page.

Usage Tracking

BentoML by default collects anonymous usage data using Amplitude. It only collects BentoML library's own actions and parameters, no user or model data will be collected. Here is the code that does it.

This helps BentoML team to understand how the community is using this tool and what to build next. You can easily opt-out of usage tracking by running the following command:

# From terminal:
bentoml config set usage_tracking=false
# From python:
import bentoml
bentoml.config().set('core', 'usage_tracking', 'False')

License

Apache License 2.0

FOSSA Status