/glowrs

A candle-rs sentence embedder library & server

Primary LanguageRust

glowrs

Library Usage

glowrs provides an easy and familiar interface to use pre-trained models for embeddings and sentence similarity. Inspired by the sentence-transformers library, which is a great Python library for sentence embeddings and features a wide range of models and utilities.

Example

use glowrs::{SentenceTransformer, Device, PoolingStrategy, Error};

fn main() -> Result<(), Error> {
    let encoder = SentenceTransformer::builder()
        .with_model_repo("sentence-transformers/all-MiniLM-L6-v2")?
        .with_device(Device::Cpu)
        .build()?;

    let sentences = vec![
        "Hello, how are you?",
        "Hey, how are you doing?"
    ];

    let embeddings = encoder.encode_batch(sentences, true)?;

    println!("{:?}", embeddings);
    
    Ok(())
}

Features

  • Load models from Hugging Face Hub
  • Use hardware acceleration (Metal, CUDA)
  • More to come!

Server Usage

glowrs-server provides a web server for sentence embedding inference. Uses candle as Tensor framework. It currently supports Bert type models hosted on Huggingface, such as those provided by sentence-transformers, Tom Aarsen, or Jina AI, as long as they provide safetensors model weights.

Example usage with the jina-embeddings-v2-base-en model:

cargo run --bin glowrs-server --release -- --core-repo jinaai/jina-embeddings-v2-base-en

If you want to use a certain revision of the model, you can append it to the repository name like so.

cargo run --bin glowrs-server --release -- --core-repo jinaai/jina-embeddings-v2-base-en:main

If you want to run multiple models, you can run multiple instances of the glowrs-server with different model repos.

cargo run --bin glowrs-server --release -- --core-repo jinaai/jina-embeddings-v2-base-en sentence-transformers/paraphrase-multilingual-mpnet-base-v2

Warning: This is not supported with metal acceleration for now.

Instructions:

Usage: glowrs-server [OPTIONS]

Options:
  -m, --core-repo <MODEL_REPO>  
  -r, --revision <REVISION>      [default: main]
  -h, --help                     Print help

Build features

  • metal: Compile with Metal acceleration
  • cuda: Compile with CUDA acceleration
  • accelerate: Compile with Accelerate framework acceleration (CPU)

Docker Usage

For now the docker image only supports CPU on x86 and arm64.

docker run -p 3000:3000 ghcr.io/wdoppenberg/glowrs-server:latest --core-repo <MODEL_REPO>

Features

  • OpenAI API compatible (/v1/embeddings) REST API endpoint
  • candle inference for bert and jina-bert models
  • Hardware acceleration (Metal for now)
  • Queueing
  • Multiple models
  • Batching
  • Performance metrics

curl

curl -X POST http://localhost:3000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "input": ["The food was delicious and the waiter...", "was too"], 
    "core": "sentence-transformers/all-MiniLM-L6-v2",
    "encoding_format": "float"
  }'

Python openai client

Install the OpenAI Python library:

pip install openai

Use the embeddings method regularly.

from openai import OpenAI
from time import time

client = OpenAI(
	api_key="sk-something",
	base_url="http://127.0.0.1:3000/v1"
)

start = time()
print(client.embeddings.create(
	input=["This is a sentence that requires an embedding"] * 50,
	model="jinaai/jina-embeddings-v2-base-en"
))

print(f"Done in {time() - start}")

# List models
print(client.models.list())

Details

  • Use TOKIO_WORKER_THREADS to set the number of threads per queue.

Disclaimer

This is still a work-in-progress. The embedding performance is decent but can probably do with some benchmarking. Furthermore, this is meant to be a lightweight embedding model library + server.

Do not use this in a production environment. If you are looking for something production-ready & in Rust, consider text-embeddings-inference.

Credits