huggingface/text-embeddings-inference

Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS

tmostak opened this issue · 0 comments

Feature request

We'd like to run the Alibaba-NLP/gte-large-en-v1.5 model on a CPU text-embedding-router server, but are hitting

Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled

Is there any way to implement/allow this model to run on CPU?

Motivation

For some of our clients we need to support a CPU embedding server, and would like to use the Alibaba-NLP/gte-large-en-v1.5 model to avail ourselves of the long 8192 token context length.

Your contribution

We'd be happy to test and run performance benchmarks if needed.