monatis/clip.cpp

[ZSL] Results doesn't match hugging face demo

ismailmaj opened this issue · 5 comments

./bin/zsl -m ../../laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin --image  ../pic.png --text "playing music" --text "playing sports"
clip_model_load: loading model from '../../laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin' - please wait....................................................clip_model_load: model size =   288.93 MB / num tensors = 397
clip_model_load: model loaded

playing music = 0.5308
playing sports = 0.4692

Expected results:
playing music = 1.000
playing sports = 0.000
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K
results

Seconded. I was about to post a similar issue.

The results are inaccurate a lot of the time. On some images it even gives inverted results, classifying X as Y and Y as X... Not clear why this is happening.

This library that has great potential, so any help is much appreciated, @monatis. Thank you!

I think it's because the tokenization strategy is different from HuggingFace CLIP tokenizer.

Fixed in #56

Thanks for the fix, @monatis! However, I'm still getting inaccurate results. For example, when trying to determine if it's a man or a woman, it almost always classifies women as men. Also, strangely enough, in some cases, the score of the text "man" is higher for some images of women than for some images of men! Please take a look at the example below:

Expectation:

27

Result:

$ ./build/bin/zsl -m ./ggml_openai_clip-vit-base-patch32/openai_clip-vit-base-patch32.ggmlv0.f16.bin --text woman --text man --image ./img/27.jpg

man = 0.9785
woman = 0.0215

Expectation:

29

Result:

$ ./build/bin/zsl -m ./ggml_openai_clip-vit-base-patch32/openai_clip-vit-base-patch32.ggmlv0.f16.bin --text woman --text man --image ./img/29.jpg

man = 0.9889
woman = 0.0111

Expectation:

32

Result:

$ ./build/bin/zsl -m ./ggml_openai_clip-vit-base-patch32/openai_clip-vit-base-patch32.ggmlv0.f16.bin --text woman --text man --image ./img/32.jpg

man = 0.9860
woman = 0.0140

As you can see in this example, the photo of the man got 0.9785 as the score for the text "man", while the 2 photos of women got 0.9889 and 0.9860, which is very weird.

z3ugma commented

I will echo that I'm experiencing the same kind of bias when calling zsl with the classes man and woman, it overwhelmingly predicts man using CLIP-ViT-B-32-laion2B-s34B-b79K f16