ColBERT-X is a generalization of ColBERT for cross-language retrieval.
ColBERT-X can be trained in two ways,
- Zero-Shot (ZS) using English MS MARCO triples, and
- Translate-Train (TT) using translated MS MARCO triples. The command for training is shown below:
CUDA_VISIBLE_DEVICES="0,1,2,3" \
python -m torch.distributed.run --nproc_per_node=4 -m \
xlmr_colbert.train --amp --doc_maxlen 180 --bsize 128 --accum 1 \
--triples /path/to/MSMARCO/triples.train.small.tsv --maxsteps 200000 \
--root /root/to/experiments/ --experiment MSMARCO-CLIR --similarity l2 --run msmarco.clir.l2
Detailed instructions for inference and PRF coming soon!
Here we list the differences between the ColBERT v1 codebase and our code
- Changed the model prefix from bert to roberta. Relevant issue here. This is necessary as the incorrect model prefix will not let the pretrained model weights be loaded and they would be initialized from scratch.
- <PAD> token id is 0 for bert tokenizer and 1 for roberta tokenizer. Relevant line here.
- roberta tokenizer does not include additional '[unused]' token prefix in the vocabulary. So, they have to be manually added and the embeddings have to be resized. Reference