michaelfeil/infinity

Different results with mixedbread-ai/mxbai-embed-large-v1 model

Closed this issue · 3 comments

System Info

The full command line used that causes issues: docker run --rm -p 7997:7997 michaelf34/infinity:latest --model-name-or-path mixedbread-ai/mxbai-embed-large-v1 --port 7997
OS version: macos
Model being used: mixedbread-ai/mxbai-embed-large-v1
Hardware used (GPUs/CPU/Accelerator) (nvidia-smi): CPU (apple silicon)
The current version being used: michaelf34/infinity:latest docker image

Information

  • Docker
  • The CLI directly via pip

Tasks

  • An officially supported command
  • My own modifications

Reproduction

Infinity:

  1. Run the docker image: docker run --rm -p 7997:7997 michaelf34/infinity:latest --model-name-or-path mixedbread-ai/mxbai-embed-large-v1 --port 7997
  2. Generate embeddings:
    curl -X 'POST' \
      'http://localhost:7997/embeddings' \
      -H 'accept: application/json' \
      -H 'Content-Type: application/json' \
      -d '{
      "input": [
        "this is a sample sentence"
      ],
      "model": "string",
      "user": "string"
    }'
    
  3. Result: [0.006450683809816837, 0.001854013535194099, 0.02515273354947567, ...]

Local Sentence Transformer comparison:

  1. Install sentence transformers: pip install -U sentence-transformers
  2. Run the sentence transformer embedding generation
    from sentence_transformers import SentenceTransformer
    model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")  
    model.encode("this is a sample sentence")
    
  3. Result: [ 0.10402928, 0.02989876, 0.40563244, ..., -0.02523705, -0.0800657 , -0.22046797]

Expected behavior

mixedbread-ai/mxbai-embed-large-v1 is one of the open source best models on MTEB right now. Will be great if the Infinity embeddings match local sentence transformer embeddings exactly.

I checked changing the model to sentence-transformers/all-mpnet-base-v2 and infinity matches local sentence transformer exactly!

@stephenleo There are two things to check here.

  1. infinity normalizes your encodings. For a good reason: the magnitude of the embeddings is mostly irrelevant, and will likely lead to unwanted drops in retrieval scores. [ 0.10402928, 0.02989876, 0.40563244, ``` does not look it will end up having | x | = 1`. This behavior is caused by normalize=True in sentence-transformers.
  2. There are some trade-offs, e.g. calculating the embeddings in fp16. Instead of asserting the same score, please use the dot product on the normalized vectors emb_a @ emb_b to verify deviating distance. Needs to be normalized.

Let me know what your output is.
I actually run a decent amount of tests to verify the sentence-transformers behaves the same. If its not the case, feel free to ping again.

Ah yes you are right. Normalizing is a good idea. Updating the sentence transformers code to the below matches sentence transformers prediction to infinity. Minor mismatches in the 10th decimal places can be ignored I think

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")  
model.encode("this is a sample sentence", normalize_embeddings=True)
[ 0.00645068,  0.00185397,  0.02515271, ..., -0.00156493, -0.00496476, -0.01367089]

Awesome!