This is a temporary anonymous repository of the paper "Curriculum Contrastive Context Denoising for Few-shot Conversational Dense Retrieval""
Install dependencies:
pip install -r requirements.txt
We provide two raw and preprocessed CAsT datasets in the datasets folder. Besides, the human annotation data is in the annotation_data folder. Please note that, although there are part of turn dependency annotations in the original dataset of CAsT 20, we find that it is not very accurate and sufficient. Therefore, we refine the original annnotation by our team.
- train.py: curriculum_sampling, two-step multi-task learning
- test.py: test with Faiss
- my_utils.py: useful functions
- models.py: CQE model architecture (i.e., ANCE)
- db_lib.py: data strctures, conversational data augmentation, curriculum_sampling
- running scripts:
- train_cast19.sh
- test_cast19.sh
- train_cast20.sh
- test_cast20.sh
First download the public pre-trained ANCE model to the checkpoints folder.
mkdir checkpoints
wget https://webdatamltrainingdiag842.blob.core.windows.net/semistructstore/OpenSource/Passage_ANCE_FirstP_Checkpoint.zip
wget https://data.thunlp.org/convdr/ad-hoc-ance-orquac.cp
unzip Passage_ANCE_FirstP_Checkpoint.zip
mv "Passage ANCE(FirstP) Checkpoint" ad-hoc-ance-msmarco
To train our COTED, run the following scripts.
# params: training_epoch, aug_ratio, loss_weight
# CAsT-19
bash train_cast19.sh 6 2 0.01
# CAsT-20
bash train_cast20.sh 6 3 0.01
For testing, you should first generate passages embeddings.
Use
python gen_tokenized_doc.py --config=gen_tokenized_doc.toml
python gen_doc_embedding.py --config=gen_doc_embedding.toml
Then, run the following scripts for testing.
The passages embeddings are expected to stored at ./datasets/collections/cast_shared/passage_embeddings.
# param: test_epoch
# CAsT-19
bash test_cast19.sh 6
or
bash test_cast19.sh final
# CAsT-20
bash test_cast20.sh 6
or
bash test_cast20.sh final