[ZSL] Results doesn't match hugging face demo
ismailmaj opened this issue · 5 comments
./bin/zsl -m ../../laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin --image ../pic.png --text "playing music" --text "playing sports"
clip_model_load: loading model from '../../laion_clip-vit-b-32-laion2b-s34b-b79k.ggmlv0.f16.bin' - please wait....................................................clip_model_load: model size = 288.93 MB / num tensors = 397
clip_model_load: model loaded
playing music = 0.5308
playing sports = 0.4692
Expected results:
playing music = 1.000
playing sports = 0.000
https://huggingface.co/laion/CLIP-ViT-B-32-laion2B-s34B-b79K
Seconded. I was about to post a similar issue.
The results are inaccurate a lot of the time. On some images it even gives inverted results, classifying X as Y and Y as X... Not clear why this is happening.
This library that has great potential, so any help is much appreciated, @monatis. Thank you!
I think it's because the tokenization strategy is different from HuggingFace CLIP tokenizer.
Thanks for the fix, @monatis! However, I'm still getting inaccurate results. For example, when trying to determine if it's a man or a woman, it almost always classifies women as men. Also, strangely enough, in some cases, the score of the text "man" is higher for some images of women than for some images of men! Please take a look at the example below:
Expectation:
Result:
$ ./build/bin/zsl -m ./ggml_openai_clip-vit-base-patch32/openai_clip-vit-base-patch32.ggmlv0.f16.bin --text woman --text man --image ./img/27.jpg
man = 0.9785
woman = 0.0215
Expectation:
Result:
$ ./build/bin/zsl -m ./ggml_openai_clip-vit-base-patch32/openai_clip-vit-base-patch32.ggmlv0.f16.bin --text woman --text man --image ./img/29.jpg
man = 0.9889
woman = 0.0111
Expectation:
Result:
$ ./build/bin/zsl -m ./ggml_openai_clip-vit-base-patch32/openai_clip-vit-base-patch32.ggmlv0.f16.bin --text woman --text man --image ./img/32.jpg
man = 0.9860
woman = 0.0140
As you can see in this example, the photo of the man got 0.9785 as the score for the text "man", while the 2 photos of women got 0.9889 and 0.9860, which is very weird.
I will echo that I'm experiencing the same kind of bias when calling zsl
with the classes man
and woman
, it overwhelmingly predicts man
using CLIP-ViT-B-32-laion2B-s34B-b79K f16