How to reproduce QGen?

Question

How to reproduce QGen?

Opened this issue 2 months ago · 0 comments

With distributed GPL generated data, I want to reproduce the QGen result.

python -m gpl.train \
    --path_to_generated_data "generated/$dataset" \
    --base_ckpt "distilbert-base-uncased" \
    --gpl_score_function "dot" \
    --batch_size_gpl 32 \
    --gpl_steps 140000 \
    --new_size -1 \
    --queries_per_passage -1 \
    --output_dir "output/$dataset" \
    --evaluation_data "./$dataset" \
    --evaluation_output "evaluation/$dataset" \
    --generator "BeIR/query-gen-msmarco-t5-base-v1" \
    --retrievers "msmarco-distilbert-base-v3" "msmarco-MiniLM-L-6-v3" \
    --retriever_score_functions "cos_sim" "cos_sim" \
    --cross_encoder "cross-encoder/ms-marco-MiniLM-L-6-v2" \
    --qgen_prefix "qgen" \
    --do_evaluation \
    # --use_amp   # Use this for efficient training if the machine supports AMP

# One can run `python -m gpl.train --help` for the information of all the arguments
# To reproduce the experiments in the paper, set `base_ckpt` to "GPL/msmarco-distilbert-margin-mse" (https://huggingface.co/GPL/msmarco-distilbert-margin-mse)

From this chunk of code, I added mnrl_output_dir and mnrl_evaluation_output parameter (such as "output/$dataset-mnrl")
And from mnrl.py, there's already batch size 75 is set and num_epoch to 1. So should it be the right way to reproduce your reported QGen score?
Also, Should I change the argument base_ckpt to "GPL/msmarco-distilbert-margin-mse" to reproduce?

If this is right, then I'm afraid that I wasn't able to get the similar result reported to the paper(28.7 for FiQA), I've only got 26.04 for FiQA.