towhee-io/towhee

[Bug]: Fix in #2617 caused embedding result change

zhfkt opened this issue · 7 comments

zhfkt commented

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Fix in #2617 caused embedding result change. I tried and it seems this commit will cause the embedding result change in model vit_base_patch16_224_in21k.

Expected Behavior

Embedding result change should not be modified.

I believe this commit is a little dangerous and will break some online services. I guess the only way is to re-index all images in milvus.

Steps To Reproduce

As mentioned in #2617 (comment)

conf = AutoConfig.load_config('image-embedding')
conf.model_name = 'vit_base_patch16_224_in21k'
image_embedding_pipeline = AutoPipes.pipeline('image-embedding', conf)


def embedding_ops_func(img):
    return image_embedding_pipeline(img).get()

# preload
def preload_model():
    return embedding_ops_func(sample_jpg_path)

print(preload_model())

Use any image for sample_jpg_path and you will see the difference. After I revert self.model.model to self.model, it will keep the original embedding result.

image

Environment

- Towhee version(e.g. v0.1.3 or 8b23a93): 1.1.0 in Python 3.11.4
- OS(Ubuntu or CentOS): Debian GNU/Linux 12 (bookworm)
- CPU/Memory: 2cpu + 8gb
- GPU: no 
- Others:

Anything else?

No response

Yes, because after fixing it, I was able to get the correct preprocessing parameters. Before fixing, it was using the default parameters.
image

image

If you don't want to be affected by this change, you can specify using the old version of the op.

zhfkt commented

I believe lots of users will suffer from this fix since it breaks the compatibility. Especially for the users who are packaging service in Docker in ci/cd, because sometimes they will install towhee model from scratch.

At least, you need to mention this change in the release note such as - !!!Caution: Break compatibility!!!.

This has almost no effect on the outcome. You can test it with the following code, the calculated cosine distances are mostly 1.

from towhee import ops, pipe, DataLoader
import numpy as np

p1 = (
    pipe.input('image_path')
    .map('image_path', 'image', ops.image_decode.cv2('rgb'))
    .map('image', 'vec', ops.image_embedding.timm(model_name='vit_base_patch16_224_in21k').revision('db6727933a'))
    .output('vec')
)

p2 = (
    pipe.input('image_path')
    .map('image_path', 'image', ops.image_decode.cv2('rgb'))
    .map('image', 'vec', ops.image_embedding.timm(model_name='vit_base_patch16_224_in21k'))
    .output('vec')
)

for image in DataLoader(ops.data_source.glob(YOUR_IMAGE_PATH)):
    print(image)
    n1 = p1(image).get()[0]
    n2 = p2(image).get()[0]
    cos = np.dot(n1, n2) / (np.linalg.norm(n1) * np.linalg.norm(n2) )
    print(cos)
zhfkt commented

Thanks for providing the hint cosine distance. I saw you were using ops.image_embedding.timm and I was using AutoPipes.pipeline. Will I need to normalize vector by np.linalg.norm myself for AutoPipes.pipeline as well ? Will AutoPipes.pipeline help me to automatically normalize the embedding ? I was not using np.linalg.norm currently for the return of AutoPipes.pipeline.

The pipeline returns vectors generated by the model without normalization.
An example of normalizing.

from towhee import ops, pipe, DataLoader, AutoConfig, AutoPipes


conf = AutoConfig.load_config('image-embedding')
conf.model_name = 'vit_base_patch16_224_in21k'
image_embedding_pipeline = AutoPipes.pipeline('image-embedding', conf)


embedding_with_noramlize = (
    pipe.input('image_path')
    .map('image_path', 'vec', image_embedding_pipeline)
    .map('vec', 'vec', ops.towhee.np_normalize())
    .output('vec')
)
stale commented

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Close the stale issues and pull requests after 7 days of inactivity. Reopen the issue with /reopen.