illuin-tech/colpali

current code for `clopali1.2` is not runnable/trainable

CommissarSilver opened this issue · 2 comments

Hi.
The current codebase provided for training/finetuning Colapli1.2 based on hard negatives is not executable and raises multiple errors along the way:

running
USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml

would require the dataloader in load_docmatix_ir_negs to load the Docmatrix dataset from HuggingFace as the anchor dataset which raises the error:

ValueError: Config name is missing. Please pick one among the available configs: ['images', 'pdf', 'zero-shot-exp'] Example of usage: load_dataset('HuggingFaceM4/Docmatix', 'images')

if this issue is resolved and the dataset_transformation function is set to load the images subset of the dataset, an error is raised during the initialization of the HardNegCollator which does not accept a tokenizer as an argument, but one is passed to it in trainer/colmodel_training.py:
TypeError: HardNegCollator.__init__() got an unexpected keyword argument 'tokenizer'

By removing the tokenizer from the collator's init function, another error is raised during calling the collator itself for training the model. The __call__ function of HardNegCollator is supposed to return the image from an example by accessing the gold_index attribute call, which does not exist in the datasets that are loaded (neither docmatrix-ir nor Docmatrix). This error is not resolvable as such an attribute does not exist in the datasets.

Can you please provide the code and the datasets that you used for fine-tuning your model on hard negatives or help with resolving these issues? If that is not possible, I would appreciate it if you can provide instructions on how to fine-tune your model on a custom dataset of hard negatives.

Thank you for your time!

Think it's fixed in qwen2 branch, whiuch will be merged ASAP !

Merged !