[Bug]: Fix in #2617 caused embedding result change
zhfkt opened this issue · 7 comments
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
Fix in #2617 caused embedding result change. I tried and it seems this commit will cause the embedding result change in model vit_base_patch16_224_in21k
.
Expected Behavior
Embedding result change should not be modified.
I believe this commit is a little dangerous and will break some online services. I guess the only way is to re-index all images in milvus.
Steps To Reproduce
As mentioned in #2617 (comment)
conf = AutoConfig.load_config('image-embedding')
conf.model_name = 'vit_base_patch16_224_in21k'
image_embedding_pipeline = AutoPipes.pipeline('image-embedding', conf)
def embedding_ops_func(img):
return image_embedding_pipeline(img).get()
# preload
def preload_model():
return embedding_ops_func(sample_jpg_path)
print(preload_model())
Use any image for sample_jpg_path
and you will see the difference. After I revert self.model.model
to self.model
, it will keep the original embedding result.
Environment
- Towhee version(e.g. v0.1.3 or 8b23a93): 1.1.0 in Python 3.11.4
- OS(Ubuntu or CentOS): Debian GNU/Linux 12 (bookworm)
- CPU/Memory: 2cpu + 8gb
- GPU: no
- Others:
Anything else?
No response
I believe lots of users will suffer from this fix since it breaks the compatibility. Especially for the users who are packaging service in Docker in ci/cd, because sometimes they will install towhee model from scratch.
At least, you need to mention this change in the release note such as - !!!Caution: Break compatibility!!!.
This has almost no effect on the outcome. You can test it with the following code, the calculated cosine distances are mostly 1.
from towhee import ops, pipe, DataLoader
import numpy as np
p1 = (
pipe.input('image_path')
.map('image_path', 'image', ops.image_decode.cv2('rgb'))
.map('image', 'vec', ops.image_embedding.timm(model_name='vit_base_patch16_224_in21k').revision('db6727933a'))
.output('vec')
)
p2 = (
pipe.input('image_path')
.map('image_path', 'image', ops.image_decode.cv2('rgb'))
.map('image', 'vec', ops.image_embedding.timm(model_name='vit_base_patch16_224_in21k'))
.output('vec')
)
for image in DataLoader(ops.data_source.glob(YOUR_IMAGE_PATH)):
print(image)
n1 = p1(image).get()[0]
n2 = p2(image).get()[0]
cos = np.dot(n1, n2) / (np.linalg.norm(n1) * np.linalg.norm(n2) )
print(cos)
Thanks for providing the hint cosine distance
. I saw you were using ops.image_embedding.timm
and I was using AutoPipes.pipeline
. Will I need to normalize vector by np.linalg.norm
myself for AutoPipes.pipeline
as well ? Will AutoPipes.pipeline
help me to automatically normalize the embedding ? I was not using np.linalg.norm
currently for the return of AutoPipes.pipeline
.
The pipeline returns vectors generated by the model without normalization.
An example of normalizing.
from towhee import ops, pipe, DataLoader, AutoConfig, AutoPipes
conf = AutoConfig.load_config('image-embedding')
conf.model_name = 'vit_base_patch16_224_in21k'
image_embedding_pipeline = AutoPipes.pipeline('image-embedding', conf)
embedding_with_noramlize = (
pipe.input('image_path')
.map('image_path', 'vec', image_embedding_pipeline)
.map('vec', 'vec', ops.towhee.np_normalize())
.output('vec')
)
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Close the stale issues and pull requests after 7 days of inactivity. Reopen the issue with /reopen
.