current code for `clopali1.2` is not runnable/trainable
CommissarSilver opened this issue · 2 comments
Hi.
The current codebase provided for training/finetuning Colapli1.2
based on hard negatives is not executable and raises multiple errors along the way:
running
USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/pali/train_colpali_docmatix_hardneg_model.yaml
would require the dataloader in load_docmatix_ir_negs to load the Docmatrix dataset from HuggingFace as the anchor dataset which raises the error:
ValueError: Config name is missing. Please pick one among the available configs: ['images', 'pdf', 'zero-shot-exp'] Example of usage: load_dataset('HuggingFaceM4/Docmatix', 'images')
if this issue is resolved and the dataset_transformation
function is set to load the images
subset of the dataset, an error is raised during the initialization of the HardNegCollator
which does not accept a tokenizer
as an argument, but one is passed to it in trainer/colmodel_training.py:
TypeError: HardNegCollator.__init__() got an unexpected keyword argument 'tokenizer'
By removing the tokenizer from the collator's init function, another error is raised during calling the collator itself for training the model. The __call__
function of HardNegCollator
is supposed to return the image from an example by accessing the gold_index
attribute call, which does not exist in the datasets that are loaded (neither docmatrix-ir nor Docmatrix). This error is not resolvable as such an attribute does not exist in the datasets.
Can you please provide the code and the datasets that you used for fine-tuning your model on hard negatives or help with resolving these issues? If that is not possible, I would appreciate it if you can provide instructions on how to fine-tune your model on a custom dataset of hard negatives.
Thank you for your time!
Think it's fixed in qwen2 branch, whiuch will be merged ASAP !
Merged !