Training for any language

Question

Training for any language

GriffithV opened this issue 5 months ago · 5 comments

is it trainable for other languages ?

Answer 1 · 2024-08-01T17:55:23.000Z

Yes, the underlying language model is largely multilingual so zero shot capabilities in other languages work out of the box but can be improved by training on language specific data !

Answer 2 · 2024-09-03T15:09:55.000Z

Hi,

In the same idea, do you plan to share the "french" weights in the future?
This is a reference to the ablation study section : "we add 1552 samples representing french..."

I'm planing to FT it in french, it is quite an amount of work that i find interesting but not justified if the weights are shared !

Thanks and great work ;)

Answer 3 · 2024-09-03T19:36:08.000Z

Yeah we plan on doing a better training than what we have in the paper !
But the current latest version of Colpali (v1.2) is already ~8% better than v1 on french tasks thanks to better training (longer warmup).

I can retrain the one with tabfquad samples as well, but I was thinking of training a french/English one soon otherwise if you can wait a week or two ! Always down to have more data if you do construct a dataset !

Answer 4 · 2024-09-04T07:43:37.000Z

Great news, thanks for the answer.

We will not have a dataset ready this quick, but we have a great interest in building one.

Answer 5 · 2024-09-04T15:20:39.000Z

Here you go: https://huggingface.co/datasets/vidore/colpali_train_set
Original dataset but a nice idea of our data composition - french will not be released for the moment but we will release some models