Base Model Linear Projection Layer Weights

Question

Base Model Linear Projection Layer Weights

contrebande-labs opened this issue 4 months ago · 5 comments

Hi guys,

Are there plans to release the training code on how the custom_text_proj.bias and custom_text_proj.weight parameter values were obtained or are they some hard-coded arbitrary constants ?

Thanks!

Answer 1 · 2024-09-01T22:53:12.000Z

All the code is in this repo !
These projection layer values are initialized with the default stochastic initialization from torch and HF

They are saved in the vidore/colpaligemma-X models which otherwise are the exact same weights as paligemma-X.

If you initialize ColPali from PaliGemma checkpoints, these layers will be randomly initialized, if you load them from ColPaliGemma base models, they'll be deterministic.

In both cases you need to add the trained adapters on top of you want the model to work.

Answer 2 · 2024-09-01T23:03:33.000Z

Hi @ManuelFay ! If it looks like I'm fumbling in the dark, here, it's because I am 😁 I'm trying to understand your code. I know that the custom_text_proj layer is trained within the adapter, but are the initialization weights in the base model (vidore/colpaligemma-3b-pt-448-base) just a fixed random seed so people can replicate your results exactly or were they heuristically obtained ? I'm trying to make a merged fp32 model from the original Google weights so I can make an TensorRT ONNX version (I will release it as-is on hf based on 1.2, and then I will train my own colpali adapter).

Answer 3 · 2024-09-01T23:05:46.000Z

And I just saw that you replied to me in the other thread also. I will send you an email with the question above if so you can reply there. Thank you for your time.

Answer 4 · 2024-09-02T09:42:59.000Z

Hey so I'll try to recap more clearly the step by step:

we take google's paligemma checkpoint. We experimented with mix and pt versions, often 448 resolution.
In our proposed architecture, there is an extra projection layer when compared to google's version. This layer is initialized stochastically when loading the weights from google's checkpoints. To guarantee everyone can start off with the same, we did it once and saved the whole model. These are the vidore/colpaligemma-base models. For memory constraints, we also push the bf16 weights instead of fp32 but fp32 would work just fine !
We then add LoRA adapters on top of the vidore/colpaligemma-base models. We train them on our data. The trained adapters are exported and pushed on our hub.
To use our adapters optimally and ensure determinism, best is thus to download the base weights with the extra projection layer vidore/colpaligemma-base and then add the adapters.

Hope this clears it up a bit more !

Answer 5 · 2024-09-02T17:57:49.000Z

Hi @ManuelFay ! Thanks for taking the time for this summary. I think this would make a good TLDR right on top of your model cards. I think the answer to my main question is in point 2: this layer is initialized stochastically when loading the weights from google's checkpoints. To guarantee everyone can start off with the same, we did it once and saved the whole model. I interpret this as the answer to where this initialization seed comes from: it's random and it is freezed for anybody starting from the colpaligemma base models. One can use the original google checkpoints, but they would then get a randomly initialized value for the linear projection layer.

But then, as I wrote in the email I sent you (I think you did a "reply all" and it also got sent to this github issue thread), I was also asking if you discovered why not having this pre-initialized in the base model could cause the problems that motivated you to put it in there for the 1.2 revision. And I proposed my own theory that it was simply a precision problem: some initialization values are clipped to zero when using data types that are prone to this problem (fp16 -not bf16, not fp8-, int8, int4, etc.) and so when the adapter values are merged in (multiplied), the resulting merging is way off. And I now further propose that if people want to replicate your results, they should work in fp32 (as I did). That is the best way to "guarantee" reproducibility as it is the only universally interoperable data type (as in deterministic, or without side effects). I for one will continue to load the google original base and adapter models at full precision, and will downsample or quantize only the resulting merged weights.

In my email, I also asked about colbert embeddings at lower dimensions, but I will send another email for that (without the github reply-to).