slight diff with CLIP image branch?
jmhessel opened this issue · 4 comments
Hi!
Thanks for de-jitting the model :)
I wanted to give a quick heads-up that the visual branch gives different results based on if you use your version vs. the original repo. I don't have time to debug this right now, but using the CLIP.png
file and model.encode_image
of the jit version vs. this version are different:
the jit version's first few values are:
3.1397e-01, -1.4601e-01, 3.0359e-01, ...
this version's first few values, using the example code, are:
3.2118e-02, -3.7777e-02, 2.4319e-01, ...
FWIW, I de-jitted just the top level computation, and was able to get an exact match on both cpu and gpu, but only when doing very specific float32/float16 conversion. Specifically, at least on the image side, I kept everything in float32, except for these two variables:
if device == 'cuda':
self.visual.positional_embedding = self.visual.positional_embedding.to(dtype=torch.float16).to(device='cuda')
self.visual.class_embedding = self.visual.class_embedding.to(dtype=torch.float16).to(device='cuda')
Not sure if this is helpful, but figured I'd mention. Thanks for doing the de-jit!!
Jack
Thank you for your detailed observation. I will debug the code deeply today.
The mistake is caused by the reshape ops in the visual branch. For now, the output is 0.3147, -0.1464, 0.3035, -0.1992, 0.0151,...
and I think the bug have been fixed. Thank you for pointing out the bug I made!!! To make the code more strong :-)
Hope you find useful for this codebase :-)
Awesome!! thanks again for all your work!! :-)
Welcome!