How to reproduce QGen?
Opened this issue · 0 comments
xMHW commented
With distributed GPL generated data, I want to reproduce the QGen result.
python -m gpl.train \
--path_to_generated_data "generated/$dataset" \
--base_ckpt "distilbert-base-uncased" \
--gpl_score_function "dot" \
--batch_size_gpl 32 \
--gpl_steps 140000 \
--new_size -1 \
--queries_per_passage -1 \
--output_dir "output/$dataset" \
--evaluation_data "./$dataset" \
--evaluation_output "evaluation/$dataset" \
--generator "BeIR/query-gen-msmarco-t5-base-v1" \
--retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
--retriever_score_functions "cos_sim" "cos_sim" \
--cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
--qgen_prefix "qgen" \
--do_evaluation \
# --use_amp # Use this for efficient training if the machine supports AMP
# One can run `python -m gpl.train --help` for the information of all the arguments
# To reproduce the experiments in the paper, set `base_ckpt` to "GPL/msmarco-distilbert-margin-mse" (https://huggingface.co/GPL/msmarco-distilbert-margin-mse)
From this chunk of code, I added mnrl_output_dir and mnrl_evaluation_output parameter (such as "output/$dataset-mnrl")
And from mnrl.py, there's already batch size 75 is set and num_epoch to 1. So should it be the right way to reproduce your reported QGen score?
Also, Should I change the argument base_ckpt to "GPL/msmarco-distilbert-margin-mse" to reproduce?
If this is right, then I'm afraid that I wasn't able to get the similar result reported to the paper(28.7 for FiQA), I've only got 26.04 for FiQA.