BentoML is an open-source framework for ML model serving, bridging the gap between Data Science and DevOps.
What does BentoML do?
- Package models trained with any framework and reproduce them for model serving in production
- Package once and deploy anywhere, supporting Docker, Kubernetes, Apache Spark, Airflow, Kubeflow, Knative, AWS Lambda, SageMaker, Azure ML, GCP, Heroku and more
- High-Performance API model server with adaptive micro-batching support
- Central hub for teams to manage and access packaged models via Web UI and API
👉 To connect with the community and ask questions, check out BentoML Discussions on Github and the BentoML Slack Community.
Moving trained Machine Learning models to serving applications in production is hard. Data Scientists are not experts in building production services. The trained models they produced are loosely specified and hard to deploy. This often leads ML teams to a time-consuming and error-prone process, where a jupyter notebook along with pickle and protobuf file being handed over to ML engineers, for turning the trained model into services that can be properly deployed and managed by DevOps.
BentoML is framework for ML model serving. It provides high-level APIs for Data Scientists to create production-ready prediction services, without them worrying about the infrastructure needs and performance optimizations. BentoML does all those under the hood, which allows DevOps to seamlessly work with Data Science team, helping to deploy and operate their models, packaged in the BentoML format.
Check out Frequently Asked Questions page on how does BentoML compares to Tensorflow-serving, Clipper, AWS SageMaker, MLFlow, etc.
Online serving with API model server:
- Containerized model server for production deployment with Docker, Kubernetes, OpenShift, AWS ECS, Azure, GCP GKE, etc
- Adaptive micro-batching for optimal online serving performance
- Discover and package all dependencies automatically, including PyPI, conda packages and local python modules
- Support multiple ML frameworks including PyTorch, Tensorflow, Scikit-Learn, XGBoost, and many more
- Serve compositions of multiple models
- Serve multiple endpoints in one model server
- Serve any Python code along with trained models
- Automatically generate HTTP API spec in Swagger/OpenAPI format
- Prediction logging and feedback logging endpoint
- Health check endpoint and Prometheus
/metrics
endpoint for monitoring - Model serving via gRPC endpoint (roadmap)
Advanced workflow for model serving and deployment:
- Central repository for managing all your team's packaged models via Web UI and API
- Launch inference run from CLI or Python, which enables CI/CD testing, programmatic access and batch offline inference job
- Distributed batch job or streaming job with Apache Spark (requires manual setup, better support for this is on roadmap)
- Automated deployment with cloud platforms including AWS Lambda, AWS SageMaker, and Azure Functions
- Advanced model deployment workflow on Kubernetes cluster, including auto-scaling, scale-to-zero, A/B testing, canary deployment, and multi-armed-bandit (roadmap)
- Deep integration with ML experimentation platforms including MLFlow, Kubeflow (roadmap)
Run this Getting Started guide on Google Colab:
BentoML requires python 3.6 or above, install with pip
:
pip install bentoml
Before starting, let's prepare a trained model for serving with BentoML.
Install required dependencies to run the example code:
pip install scikit-learn pandas
Train a classifier model on the Iris data set:
from sklearn import svm
from sklearn import datasets
# Load training data
iris = datasets.load_iris()
X, y = iris.data, iris.target
# Model Training
clf = svm.SVC(gamma='scale')
clf.fit(X, y)
Here's what a minimal prediction service in BentoML looks like:
import pandas as pd
from bentoml import env, artifacts, api, BentoService
from bentoml.adapters import DataframeInput
from bentoml.artifact import SklearnModelArtifact
@env(auto_pip_dependencies=True)
@artifacts([SklearnModelArtifact('model')])
class IrisClassifier(BentoService):
@api(input=DataframeInput())
def predict(self, df: pd.DataFrame):
# Optional pre-processing, post-processing code goes here
return self.artifacts.model.predict(df)
This code defines a prediction service that packages a scikit-learn model and provides
an inference API that expects a pandas.Dataframe
object as its input. BentoML also
supports other API input data types including JsonInput
, ImageInput
, FileInput
and
more.
In BentoML, all inference APIs are suppose to accept a list of inputs and return a
list of results. In the case of DataframeInput
, each row of the dataframe is mapping
to one prediction request received from the client. BentoML will convert HTTP JSON
requests into pandas.DataFrame
object before passing it to the user-defined
inference API function.
This design allows BentoML to group API requests into small batches while serving online traffic. Comparing to a regular flask or FastAPI based model server, this can increases the overall throughput of the API server by 10-100x depending on the workload.
The following code packages the trained model with the prediction service class
IrisClassifier
defined above, and then saves the IrisClassifier instance to disk
in the BentoML format for distribution and deployment:
# import the IrisClassifier class defined above
from iris_classifier import IrisClassifier
# Create a iris classifier service instance
iris_classifier_service = IrisClassifier()
# Pack the newly trained model artifact
iris_classifier_service.pack('model', clf)
# Save the prediction service to disk for model serving
saved_path = iris_classifier_service.save()
BentoML stores all packaged model files under the
~/bentoml/{service_name}/{service_version}
directory by default.
The BentoML file format contains all the code, files, and configs required to
deploy the model for serving.
To start a REST API model server with the IrisClassifier
saved above, use
the bentoml serve
command:
bentoml serve IrisClassifier:latest
The IrisClassifier
model is now served at localhost:5000
. Use curl
command to send
a prediction request:
$ curl -i \
--header "Content-Type: application/json" \
--request POST \
--data '[[5.1, 3.5, 1.4, 0.2]]' \
http://localhost:5000/predict
Or with python
and request library:
import requests
response = requests.post("http://127.0.0.1:5000/predict", json=[[5.1, 3.5, 1.4, 0.2]])
print(response.text)
Note that BentoML API server automatically converts the Dataframe JSON format into a
pandas.DataFrame
object before sending it to the user-defined inference API function.
The BentoML API server also provides a simple web UI dashboard. Go to http://localhost:5000 in the browser and use the Web UI to send prediction request:
One common way of distributing this model API server for production deployment, is via Docker containers. And BentoML provides a convenient way to do that.
If you already have docker configured, run the following command to build a
docker container image for serving the IrisClassifier
prediction service created above:
$ bentoml containerize IrisClassifier:latest -t iris-classifier
Start a container with the docker image built from the previous step:
$ docker run -p 5000:5000 iris-classifier --enable-microbatch --workers=1
Continue reading the getting started guide here.
BentoML full documentation: https://docs.bentoml.org/
- Quick Start Guide: https://docs.bentoml.org/en/latest/quickstart.html
- Core Concepts: https://docs.bentoml.org/en/latest/concepts.html
- Deployment Guides: https://docs.bentoml.org/en/latest/deployment/index.html
- API References: https://docs.bentoml.org/en/latest/api/index.html
- Frequently Asked Questions: https://docs.bentoml.org/en/latest/faq.html
BentoML supports these ML frameworks out-of-the-box:
- Scikit-Learn - Docs | Examples
- PyTorch - Docs | Examples
- Tensorflow 2 - Docs | Examples
- Tensorflow Keras - Docs | Examples
- XGBoost - Docs | Examples
- LightGBM - Docs | Examples
- FastText - Docs | Examples
- FastAI - Docs | Examples
- H2O - Docs | Examples
- ONNX - Docs | Examples
- CoreML - Docs
- Spacy - Docs
Visit bentoml/gallery repository for list of example ML projects built with BentoML.
-
Automated deployment with BentoML
-
Deploy with open-source platforms:
-
Deploy with cloud services:
Have questions or feedback? Post a new github issue or discuss in our Slack channel:
Want to help build BentoML? Check out our contributing guide and the development guide.
BentoML is under active development and is evolving rapidly. Currently it is a Beta release, we may change APIs in future releases.
Read more about the latest features and changes in BentoML from the releases page.
BentoML by default collects anonymous usage data using Amplitude. It only collects BentoML library's own actions and parameters, no user or model data will be collected. Here is the code that does it.
This helps BentoML team to understand how the community is using this tool and what to build next. You can easily opt-out of usage tracking by running the following command:
# From terminal:
bentoml config set usage_tracking=false
# From python:
import bentoml
bentoml.config().set('core', 'usage_tracking', 'False')