weiyx16/CLIP-pytorch

slight diff with CLIP image branch?

jmhessel opened this issue · 4 comments

Hi!

Thanks for de-jitting the model :)

I wanted to give a quick heads-up that the visual branch gives different results based on if you use your version vs. the original repo. I don't have time to debug this right now, but using the CLIP.png file and model.encode_image of the jit version vs. this version are different:

the jit version's first few values are:
3.1397e-01, -1.4601e-01, 3.0359e-01, ...

this version's first few values, using the example code, are:
3.2118e-02, -3.7777e-02, 2.4319e-01, ...

FWIW, I de-jitted just the top level computation, and was able to get an exact match on both cpu and gpu, but only when doing very specific float32/float16 conversion. Specifically, at least on the image side, I kept everything in float32, except for these two variables:

if device == 'cuda':
  self.visual.positional_embedding = self.visual.positional_embedding.to(dtype=torch.float16).to(device='cuda')
  self.visual.class_embedding = self.visual.class_embedding.to(dtype=torch.float16).to(device='cuda')

Not sure if this is helpful, but figured I'd mention. Thanks for doing the de-jit!!

Jack

Thank you for your detailed observation. I will debug the code deeply today.

The mistake is caused by the reshape ops in the visual branch. For now, the output is 0.3147, -0.1464, 0.3035, -0.1992, 0.0151,... and I think the bug have been fixed. Thank you for pointing out the bug I made!!! To make the code more strong :-)
Hope you find useful for this codebase :-)

Awesome!! thanks again for all your work!! :-)

Welcome!