Text Embeddings Inference

A blazing fast inference solution for text embeddings models.

Benchmark for BAAI/bge-base-en-v1.5 on a Nvidia A10 with a sequence length of 512 tokens:

Get Started
No compilation step
Dynamic shapes
Small docker images and fast boot times. Get ready for true serverless!
Token based dynamic batching
Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
Safetensors weight loading
Production ready (distributed tracing with Open Telemetry, Prometheus metrics)

Get Started

Supported Models

You can use any BERT model with absolute positions in text-embeddings-inference. If the model does not have safetensors weights you can convert it using this space.

Support for other model types will be added in the future.

MTEB Rank	Model Type	Model ID	Specific Revision
1	Bert	BAAI/bge-large-en-v1.5	refs/pr/5
2		BAAI/bge-base-en-v1.5	refs/pr/1
3		llmrails/ember-v1
4		thenlper/gte-large
5		thenlper/gte-base
6		intfloat/e5-large-v2
7		BAAI/bge-small-en-v1.5	refs/pr/3
10		intfloat/e5-base-v2

You can explore the list of best performing text embeddings models here.

Docker

model=BAAI/bge-large-en-v1.5
revision=refs/pr/5
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:latest --model-id $model --revision $revision

And then you can make requests like

curl 127.0.0.1:8080/embed \
    -X POST \
    -d '{"inputs":"What is Deep Learning?"}' \
    -H 'Content-Type: application/json'

Note: To use GPUs, you need to install the NVIDIA Container Toolkit. We also recommend using NVIDIA drivers with CUDA version 12 or higher.

To see all options to serve your models:

text-embeddings-router --help

Usage: text-embeddings-router [OPTIONS]

Options:
      --model-id <MODEL_ID>
          The name of the model to load. Can be a MODEL_ID as listed on <https://hf.co/models> like `thenlper/gte-base`. Or it can be a local directory containing the necessary files as saved by `save_pretrained(...)` methods of transformers

          [env: MODEL_ID=]
          [default: thenlper/gte-base]

      --revision <REVISION>
          The actual revision of the model if you're referring to a model on the hub. You can use a specific commit id or a branch like `refs/pr/2`

          [env: REVISION=]

      --tokenization-workers <TOKENIZATION_WORKERS>
          The number of tokenizer workers used for payload validation and truncation inside the router

          [env: TOKENIZATION_WORKERS=]
          [default: 8]

      --dtype <DTYPE>
          The dtype to be forced upon the model

          [env: DTYPE=]
          [default: float16]
          [possible values: float16, float32]

      --max-concurrent-requests <MAX_CONCURRENT_REQUESTS>
          The maximum amount of concurrent requests for this particular deployment. Having a low limit will refuse clients requests instead of having them wait for too long and is usually good to handle backpressure correctly

          [env: MAX_CONCURRENT_REQUESTS=]
          [default: 512]

      --max-batch-tokens <MAX_BATCH_TOKENS>
          **IMPORTANT** This is one critical control to allow maximum usage of the available hardware.

          This represents the total amount of potential tokens within a batch.

          For `max_batch_tokens=1000`, you could fit `10` queries of `total_tokens=100` or a single query of `1000` tokens.

          Overall this number should be the largest possible until the model is compute bound. Since the actual memory overhead depends on the model implementation, text-embeddings-inference cannot infer this number automatically.

          [env: MAX_BATCH_TOKENS=]
          [default: 16384]

      --max-batch-requests <MAX_BATCH_REQUESTS>
          Optionally control the maximum number of individual requests in a batch

          [env: MAX_BATCH_REQUESTS=]

      --max-client-batch-size <MAX_CLIENT_BATCH_SIZE>
          Control the maximum number of inputs that a client can send in a single request

          [env: MAX_CLIENT_BATCH_SIZE=]
          [default: 32]

      --hf-api-token <HF_API_TOKEN>
          Your HuggingFace hub token

          [env: HF_API_TOKEN=]

      --hostname <HOSTNAME>
          The IP address to listen on

          [env: HOSTNAME=]
          [default: 0.0.0.0]

  -p, --port <PORT>
          The port to listen on

          [env: PORT=]
          [default: 3000]

      --uds-path <UDS_PATH>
          The name of the unix socket some text-embeddings-inference backends will use as they communicate internally with gRPC

          [env: UDS_PATH=]
          [default: /tmp/text-embeddings-inference-server]

      --huggingface-hub-cache <HUGGINGFACE_HUB_CACHE>
          The location of the huggingface hub cache. Used to override the location if you want to provide a mounted disk for instance

          [env: HUGGINGFACE_HUB_CACHE=/data]

      --json-output
          Outputs the logs in JSON format (useful for telemetry)

          [env: JSON_OUTPUT=]

      --otlp-endpoint <OTLP_ENDPOINT>
          [env: OTLP_ENDPOINT=]

      --cors-allow-origin <CORS_ALLOW_ORIGIN>
          [env: CORS_ALLOW_ORIGIN=]

Docker Images

Text Embeddings Inference ships with multiple Docker images that you can use to target a specific backend:

Architecture	Image
CPU	ghcr.io/huggingface/text-embeddings-inference:cpu-latest
Turing	ghcr.io/huggingface/text-embeddings-inference:turing-latest
Ampere 80	ghcr.io/huggingface/text-embeddings-inference:latest
Ampere 86	ghcr.io/huggingface/text-embeddings-inference:86-latest
Hopper	ghcr.io/huggingface/text-embeddings-inference:hopper-latest

API documentation

You can consult the OpenAPI documentation of the text-embeddings-inference REST API using the /docs route. The Swagger UI is also available at: https://huggingface.github.io/text-embeddings-inference.

Using a private or gated model

You have the option to utilize the HUGGING_FACE_HUB_TOKEN environment variable for configuring the token employed by text-embeddings-inference. This allows you to gain access to protected resources.

For example:

Go to https://huggingface.co/settings/tokens
Copy your cli READ token
Export HUGGING_FACE_HUB_TOKEN=<your cli READ token>

or with Docker:

model=<your private model>
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
token=<your cli READ token>

docker run --gpus all -e HUGGING_FACE_HUB_TOKEN=$token -p 8080:80 -v $volume:/data --pull always ghcr.io/huggingface/text-embeddings-inference:latest --model-id $model

Distributed Tracing

text-embeddings-inference is instrumented with distributed tracing using OpenTelemetry. You can use this feature by setting the address to an OTLP collector with the --otlp-endpoint argument.

Local install

CPU

You can also opt to install text-embeddings-inference locally.

First install Rust:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then run:

cargo install --path router -F candle --no-default-features

You can now launch Text Embeddings Inference on CPU with:

model=BAAI/bge-large-en-v1.5
revision=refs/pr/5

text-embeddings-router --model-id $model --revision $revision --port 8080

Note: on some machines, you may also need the OpenSSL libraries and gcc. On Linux machines, run:

sudo apt-get install libssl-dev gcc -y

Cuda

Make sure you have Cuda and the nvidia drivers installed. We recommend using NVIDIA drivers with CUDA version 12 or higher. You also need to add the nvidia binaries to your path:

export PATH=$PATH:/usr/local/cuda/bin