I created this repo to facilitate my own work with deploying Huggingface models to Triton Inference Server. In my homelab case, I use this to create embeddings for my Qdrant embedding index.
Install the optimum-cli
tool from the Optimum
pip install optimum[exporters]
optimum-cli export onnx --model intfloat/multilingual-e5-large-instruct multilingual-e5-large-instruct
Then put the tokenizer in the directory with Python backend and the ONNX model in the directory with the ONNX backend like:
model_repository
├── ONNX-embedding-multilingual-e5-large-instruct
│ ├── 1
│ │ ├── config.json
│ │ ├── model.onnx
│ │ └── model.onnx_data
│ └── config.pbtxt
└── TOKENIZER-multilingual-e5-large-instruct
├── 1
│ ├── model.py
│ └── tokenizer
│ ├── sentencepiece.bpe.model
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ └── tokenizer_config.json
└── config.pbtxt
You can use devcontainer in VSCode to develop and test the model.
Open the command palette and select Dev Container: Reopen in Container
.
Then
make run
You can query the model in the dev container or use port-forwarding to query the model from your local machine.
make query-tokenizer
make query-onnx
Example data is included by the makefile.
Checkout Triton Inference Server for more information.
Me personally use the S3 and make triton to pull the model from the S3. All you have to do is add the environment variables AWS_ACCESS_KEY_ID
, AWS_SECRET_ACCESS_KEY
and AWS_DEFAULT_REGION
to the triton container and add the argument for s3 bucket.
$ tritonserver --model-repository=s3://bucket/path/to/model/repository ...
- Perf Analyzer guide
- Add ensemble model with average pooling
- More backends
- Optimize the inference for RTX 3090