Different results with mixedbread-ai/mxbai-embed-large-v1 model
Closed this issue · 3 comments
System Info
The full command line used that causes issues: docker run --rm -p 7997:7997 michaelf34/infinity:latest --model-name-or-path mixedbread-ai/mxbai-embed-large-v1 --port 7997
OS version: macos
Model being used: mixedbread-ai/mxbai-embed-large-v1
Hardware used (GPUs/CPU/Accelerator) (nvidia-smi): CPU (apple silicon)
The current version being used: michaelf34/infinity:latest
docker image
Information
- Docker
- The CLI directly via pip
Tasks
- An officially supported command
- My own modifications
Reproduction
Infinity:
- Run the docker image:
docker run --rm -p 7997:7997 michaelf34/infinity:latest --model-name-or-path mixedbread-ai/mxbai-embed-large-v1 --port 7997
- Generate embeddings:
curl -X 'POST' \ 'http://localhost:7997/embeddings' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input": [ "this is a sample sentence" ], "model": "string", "user": "string" }'
- Result:
[0.006450683809816837, 0.001854013535194099, 0.02515273354947567, ...]
Local Sentence Transformer comparison:
- Install sentence transformers:
pip install -U sentence-transformers
- Run the sentence transformer embedding generation
from sentence_transformers import SentenceTransformer model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1") model.encode("this is a sample sentence")
- Result:
[ 0.10402928, 0.02989876, 0.40563244, ..., -0.02523705, -0.0800657 , -0.22046797]
Expected behavior
mixedbread-ai/mxbai-embed-large-v1
is one of the open source best models on MTEB right now. Will be great if the Infinity embeddings match local sentence transformer embeddings exactly.
I checked changing the model to sentence-transformers/all-mpnet-base-v2
and infinity matches local sentence transformer exactly!
@stephenleo There are two things to check here.
- infinity normalizes your encodings. For a good reason: the magnitude of the embeddings is mostly irrelevant, and will likely lead to unwanted drops in retrieval scores.
[ 0.10402928, 0.02989876, 0.40563244, ``` does not look it will end up having
| x | = 1`. This behavior is caused by normalize=True in sentence-transformers. - There are some trade-offs, e.g. calculating the embeddings in
fp16
. Instead of asserting the same score, please use the dot product on the normalized vectorsemb_a @ emb_b
to verify deviating distance. Needs to be normalized.
Let me know what your output is.
I actually run a decent amount of tests to verify the sentence-transformers behaves the same. If its not the case, feel free to ping again.
Ah yes you are right. Normalizing is a good idea. Updating the sentence transformers code to the below matches sentence transformers prediction to infinity. Minor mismatches in the 10th decimal places can be ignored I think
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")
model.encode("this is a sample sentence", normalize_embeddings=True)
[ 0.00645068, 0.00185397, 0.02515271, ..., -0.00156493, -0.00496476, -0.01367089]
Awesome!