This project is a sentence embedding API service built with BentoML. With one command, you can launch a high-performance REST API server for generating text embeddings. It comes with all-MiniLM-L6-v2 as the default embedding model, but you can easily customize it to use other embedding models.
Looking for Image Embeddings? Check out CLIP-API-service.
To quickly get started, follow the instructions below or try this tutorial in Google Colab: Sentence Embedding with BentoML
The pre-built Docker Images for this project can be found on GitHub Container registry here.
First, ensure you have Docker installed and running.
Launch the embedding service locally with the following command:
docker run --rm -p 3000:3000 ghcr.io/bentoml/sentence-embedding-bento:latest
Open http://0.0.0.0:3000 from your browser to send test requests from the Web UI.
Alternatively, generate text embeddings with BentoML Python API client or CURL command:
from bentoml.client import Client
client = Client.from_url("http://localhost:3000")
samples = [
"The dinner was great!",
"The weather is great today!",
"I love fried chicken sandwiches!"
]
print(client.encode(samples))
curl -X POST http://localhost:3000/encode \
-H 'Content-Type: application/json' \
-d '["hello world, how are you?", "I love fried chicken sandwiches!"]'
To run model inference with GPU, install the NVIDIA Container Toolkit and use the GPU-enabled docker image instead:
docker run --gpu --rm -p 3000:3000 ghcr.io/bentoml/sentence-embedding-bento-gpu:latest
This repository is meant to be hackable and educational for building your own text embedding service with BentoML. Get started by forking and cloning this repository:
git clone https://github.com/bentoml/sentence-embedding-bento.git
cd sentence-embedding-bento
You will need Python 3.8 or above to run this example.
Download dependencies via pip
:
pip install -U -r ./requirements.txt
python import_model.py
This saves and versions the all-MiniLM-L6-v2
in your local BentoML model store.
Start the embedding service:
bentoml serve
Bento is the standardize distribution format, which is supported by an array of downstream deployment tools provided in the BentoML eco-system. It captures your service code, models, and configurations in one place, version control it automatically, and ensures reproducibility across yoru development and production environments.
> bentoml build
██████╗ ███████╗███╗ ██╗████████╗ ██████╗ ███╗ ███╗██╗
██╔══██╗██╔════╝████╗ ██║╚══██╔══╝██╔═══██╗████╗ ████║██║
██████╔╝█████╗ ██╔██╗ ██║ ██║ ██║ ██║██╔████╔██║██║
██╔══██╗██╔══╝ ██║╚██╗██║ ██║ ██║ ██║██║╚██╔╝██║██║
██████╔╝███████╗██║ ╚████║ ██║ ╚██████╔╝██║ ╚═╝ ██║███████╗
╚═════╝ ╚══════╝╚═╝ ╚═══╝ ╚═╝ ╚═════╝ ╚═╝ ╚═╝╚══════╝
Successfully built Bento(tag="sentence-embedding-svc:scyvqxrxlc4rduqj").
Possible next steps:
* Containerize your Bento with `bentoml containerize`:
$ bentoml containerize sentence-embedding-svc:scyvqxrxlc4rduqj [or bentoml build --containerize]
* Push to BentoCloud with `bentoml push`:
$ bentoml push sentence-embedding-svc:scyvqxrxlc4rduqj [or bentoml build --push]
You can also try the simplified build script
GPU=true HF_MODEL=BAAI/bge-small-zh-v1.5 bash simple_build.sh
BentoML provides a number of deployment options. The easiest way to set up a production-ready endpoint of your text embedding service is via BentoCloud, the serverless cloud platform built for BentoML, by the BentoML team.
Next steps:
- Sign up for a BentoCloud account here.
- Get an API Token, see instructions here.
- Push your Bento to BentoCloud:
bentoml push sentence-embedding-svc:latest
- Deploy via Web UI, see Deploying on BentoCloud
Looking to use a different embedding model? Check out the MTEB Leaderboard
and decide which embedding model works best for your use case. Modify code in the
import_model.py
, embedding_runnable.py
, and service.py
file to replace the model used.
See BentoML docs for advanced topics such as
performance optimization, runtime configurations, serving with GPU, and adaptive
batching.
👉 Join our AI Application Developer community!