illuin-tech/colpali

Generalize the training configuration

efenocchi opened this issue · 3 comments

Hi guys, great project!
I started doing some tests with fine-tuning on private datasets and the 4 datasets you suggested in this issue.

My question is related to the configuration files. I noticed that with PaliGemma, you trained both language_model and custom_text_proj , while with Idefics2, you only trained the text_model. Is there a reason why you chose to train more parts in one model than the other?

If I switch to a different model (like Idefics3 or others), is there a way to find the best configurations, or did you rely solely on the final benchmarks?

Last question: to make a train in a distributed way like you did, do you recommend changing some configurations?

Thanks for your time!

I also take this opportunity to ask you if you have any advice for generating the synthetic dataset part Our training dataset of 127,460 query-page pairs is comprised of train sets of openly available academic datasets (63%) and a synthetic dataset made up of pages from web-crawled PDF documents and augmented with VLM-generated (Claude-3 Sonnet) pseudo-questions (37%).

Hey ! We'll be releasing the data as soon as I have a glimpse of time ! Our synthetic data is basically giving Claude Sonnet an image of a page and asking it to generate queries that could be (partially/fully) answered with information contained on the page ! We're experimenting with different generation processes at the moment.

The Docmatix dataset is also a good ressource to understand synthetic query generation and solely training with DocMatix data, I got to about 3% of the OG model, so data quality is not (yet) the bottleneck for performance.

Training the text proj doesn't help performance that much actually. It just felt a bit "wrong" to leave it at it's random init state so eventually I decided to train it as well but no real perf difference. There are ablations in the paper about training the vision component if you are interested. For these sorts of changes, I found monitoring the validation set loss is often sufficient to see if an architectural change is helpful or not so no need to benchmark everything if you are looking for very quick iterations !

Lastly, distributed training in our case is on a single node so we are able to run our contrastive losses on samples contained on other GPUs on the same node. If you go multi-node, you should probably think of whether you want to simulate super large batch sizes or rather go for just parallelism.
On single GPU setups, the smaller batch sizes probably won't help training but training with pre-mined hard negatives compensates for this !

everything is clear, thanks for the clarification!