Different results with mixedbread-ai/mxbai-embed-large-v1 model

Question

Different results with mixedbread-ai/mxbai-embed-large-v1 model

Closed this issue 4 months ago · 3 comments

System Info

The full command line used that causes issues: docker run --rm -p 7997:7997 michaelf34/infinity:latest --model-name-or-path mixedbread-ai/mxbai-embed-large-v1 --port 7997
OS version: macos
Model being used: mixedbread-ai/mxbai-embed-large-v1
Hardware used (GPUs/CPU/Accelerator) (nvidia-smi): CPU (apple silicon)
The current version being used: michaelf34/infinity:latest docker image

Information

Docker
The CLI directly via pip

Tasks

An officially supported command
My own modifications

Reproduction

Infinity:

Run the docker image: docker run --rm -p 7997:7997 michaelf34/infinity:latest --model-name-or-path mixedbread-ai/mxbai-embed-large-v1 --port 7997

Generate embeddings:

curl -X 'POST' \
  'http://localhost:7997/embeddings' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "input": [
    "this is a sample sentence"
  ],
  "model": "string",
  "user": "string"
}'

Result: [0.006450683809816837, 0.001854013535194099, 0.02515273354947567, ...]

Local Sentence Transformer comparison:

Install sentence transformers: pip install -U sentence-transformers

Run the sentence transformer embedding generation

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")  
model.encode("this is a sample sentence")

Result: [ 0.10402928, 0.02989876, 0.40563244, ..., -0.02523705, -0.0800657 , -0.22046797]

Expected behavior

mixedbread-ai/mxbai-embed-large-v1 is one of the open source best models on MTEB right now. Will be great if the Infinity embeddings match local sentence transformer embeddings exactly.

I checked changing the model to sentence-transformers/all-mpnet-base-v2 and infinity matches local sentence transformer exactly!

michaelfeil commented 4 months ago

Awesome!

Answer 1 · 2024-05-14T16:01:28.000Z

@stephenleo There are two things to check here.

infinity normalizes your encodings. For a good reason: the magnitude of the embeddings is mostly irrelevant, and will likely lead to unwanted drops in retrieval scores. [ 0.10402928, 0.02989876, 0.40563244, ``` does not look it will end up having | x | = 1`. This behavior is caused by normalize=True in sentence-transformers.
There are some trade-offs, e.g. calculating the embeddings in fp16. Instead of asserting the same score, please use the dot product on the normalized vectors emb_a @ emb_b to verify deviating distance. Needs to be normalized.

Let me know what your output is.
I actually run a decent amount of tests to verify the sentence-transformers behaves the same. If its not the case, feel free to ping again.

Answer 2 · 2024-05-25T03:27:12.000Z

Ah yes you are right. Normalizing is a good idea. Updating the sentence transformers code to the below matches sentence transformers prediction to infinity. Minor mismatches in the 10th decimal places can be ignored I think

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")  
model.encode("this is a sample sentence", normalize_embeddings=True)

[ 0.00645068,  0.00185397,  0.02515271, ..., -0.00156493, -0.00496476, -0.01367089]