Support Alibaba-NLP/gte-large-en-v1.5 on CPU/MPS
tmostak opened this issue · 0 comments
tmostak commented
Feature request
We'd like to run the Alibaba-NLP/gte-large-en-v1.5
model on a CPU text-embedding-router
server, but are hitting
Caused by:
Could not start backend: GTE is only supported on Cuda devices in fp16 with flash attention enabled
Is there any way to implement/allow this model to run on CPU?
Motivation
For some of our clients we need to support a CPU embedding server, and would like to use the Alibaba-NLP/gte-large-en-v1.5
model to avail ourselves of the long 8192 token context length.
Your contribution
We'd be happy to test and run performance benchmarks if needed.