Proxy server for triton gRPC server that inferences embedding model in Rust.
- it refines the request and response formats of the Triton server.
- without
tritonclient
dependency. - fast & easy to use.
BAAI/bge-m3
is used for an example.- It'll convert Pytorch into onnx model with the cls pooling + l2 normalization layers, and save it to
./model_repository/embedding/1/model.onnx
.- if you don't want to add the pooling + l2 normalization layers, then need to change the
config.pbtxt
properly.
- if you don't want to add the pooling + l2 normalization layers, then need to change the
- Currently,
max_batch_size
is limited to256
due to OOM. You can change this value to fit your environment.
python3 convert.py
- I'll run both Triton inference server and the proxy server.
- You need to edit the absolute path of the volume (where pointed to the
./model_repository
) indocker-compose.yml
.
make run-docker-compose
- You can also build and run a triton proxy server with the below command.
export RUSTFLAGS="-C target-cpu=native"
make server
make build-docker
docker run --gpus all --rm --ipc=host --shm-size=8g --ulimit memlock=-1 --ulimit stack=67108864 -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)triton-grpc-proxy-rs/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3 bash -c "LD_PRELOAD=/usr/lib/$(uname -m)-linux-gnu/libtcmalloc.so.4:${LD_PRELOAD} && pip install transformers tokenizers && tritonserver --model-repository=/models"
- recieve request(s) from the user.
- list of
text (String)
in this case.
- list of
- request the Triton gRPC server to get embeddings.
- post-process (cast and reshape) the embeddings and returns to the users.
- Swagger :
http://localhost:8080/explorer/
-
parse configuration from the env variables.
-
SERVER_PORT
: proxy server port. default8080
. -
TRITON_SERVER_URL
: triton inference gRPC server url. defaulthttp://triton-server
. -
TRITON_SERVER_GRPC_PORT
: triton inference gRPC server port. default8001
. -
MODEL_VERSION
: model version. default1
. -
MODEL_NAME
: model name. defaultmodel
. -
INPUT_NAME
: input name. defaulttext
. -
OUTPUT_NAME
: output name. defaultembedding
. -
EMBEDDING_SIZE
: size of the embedding. default1024
.
- GET
/health
curl -i http://127.0.0.1:8080/health
HTTP/1.1 200 OK
content-length: 2
date: Sun, 08 Oct 2023 06:33:53 GMT
ok
get prometheus metrics
- GET
/metrics
curl -i http://127.0.0.1:8080/metrics
- POST
/v1/embedding
- Request Body :
[{'query': 'input'}, ... ]
curl -H "Content-type:application/json" -X POST http://127.0.0.1:8080/v1/embedding -d "[{\"query\": \"asdf\"}, {\"query\": \"asdf asdf\"}, {\"query\": \"asdf asdf asdf\"}, {\"query\": \"asdf asdf asdf asdf\"}]"
- Response Body :
[{'embedding': '1024 f32 vector'}, ...]
[{"embedding": [-0.8067292,-0.004603,-0.24123234,0.59398544,-0.5583446,...]}, ...]
- Environment
- CPU : i7-7700K (not overclocked)
- GPU : GTX 1060 6 GB
- Rust : v1.79.0 stable
- Triton Server :
24-07-py3
- backend : onnxruntime-gpu
- allocator : tcmalloc
- model :
BAAI/bge-m3
w/ fp32
- payload :
[{'query': 'asdf' * 126}] * batch_size
(asdf * 126 == 255 tokens
) - stages
- model : only triton gRPC server latency (preprocess + tokenize + model)
- processing : end-to-end latency (service-side)
- json de/serialization
- payload serialization (byte string, float vector)
- cast & reshape 2d vectors
batch size | model (p90) | processing (p90) |
---|---|---|
8 | 1428.20 ms | 0.044 ms |
16 | 2915.01 ms | 0.051 ms |
32 | 5626.15 ms | 0.055 ms |
- optimize the processing performance and memory usage
- support
/metrics
endpoint to get prometheus metrics - OpenAPI specifications and Swagger UI
- add
Dockerfile
anddocker-compose
to easily deploy the servers - triton inference server
- add model converter script.
- configurations
- move hard-coded configs to
env
- optimize the proxy server performance
- README
- move
tokenizer
part from triton server intoproxy-server