Training for any language
GriffithV opened this issue · 5 comments
is it trainable for other languages ?
Yes, the underlying language model is largely multilingual so zero shot capabilities in other languages work out of the box but can be improved by training on language specific data !
Hi,
In the same idea, do you plan to share the "french" weights in the future?
This is a reference to the ablation study section : "we add 1552 samples representing french..."
I'm planing to FT it in french, it is quite an amount of work that i find interesting but not justified if the weights are shared !
Thanks and great work ;)
Yeah we plan on doing a better training than what we have in the paper !
But the current latest version of Colpali (v1.2) is already ~8% better than v1 on french tasks thanks to better training (longer warmup).
I can retrain the one with tabfquad samples as well, but I was thinking of training a french/English one soon otherwise if you can wait a week or two ! Always down to have more data if you do construct a dataset !
Great news, thanks for the answer.
We will not have a dataset ready this quick, but we have a great interest in building one.
Here you go: https://huggingface.co/datasets/vidore/colpali_train_set
Original dataset but a nice idea of our data composition - french will not be released for the moment but we will release some models