illuin-tech/colpali

Training for any language

Closed this issue · 5 comments

is it trainable for other languages ?

Yes, the underlying language model is largely multilingual so zero shot capabilities in other languages work out of the box but can be improved by training on language specific data !

Hi,

In the same idea, do you plan to share the "french" weights in the future?
This is a reference to the ablation study section : "we add 1552 samples representing french..."

I'm planing to FT it in french, it is quite an amount of work that i find interesting but not justified if the weights are shared !

Thanks and great work ;)

Yeah we plan on doing a better training than what we have in the paper !
But the current latest version of Colpali (v1.2) is already ~8% better than v1 on french tasks thanks to better training (longer warmup).

I can retrain the one with tabfquad samples as well, but I was thinking of training a french/English one soon otherwise if you can wait a week or two ! Always down to have more data if you do construct a dataset !

Great news, thanks for the answer.

We will not have a dataset ready this quick, but we have a great interest in building one.

Here you go: https://huggingface.co/datasets/vidore/colpali_train_set
Original dataset but a nice idea of our data composition - french will not be released for the moment but we will release some models