Reproducing table 2
eunseongc opened this issue · 4 comments
Dear Authors,
Hello, I am a graduate student studying information retrieval from South Korea.
First of all, thank you for sharing your great work.
I am facing difficulty in reproducing the experimental results on NQ data.
I will try to be as brief as possible.
Model used:
M1) GenRead-3B-NQ,
Contexts used:
C1) supervised:clustering (Recall@10: 71.3)
C2) DPR (FiD-distil) (The one provided by the FiD authors from here (https://github.com/facebookresearch/FiD)) (Recall@10: 80.3)
Note that I fixed the number of used documents to 10. (--n_contexts argument)
Since we have 1 model and 2 contexts, there are 2 possible combinations. i.e. M1+C1, M1+C2.
The commands I cloned and ran for the FiD repo are shown below.
python test_reader.py --model_path {model_path} --eval_data {test_json_path} --per_gpu_batch_size 1 --n
_context 10
M1+C1 is reported as 45.6 in table 2, but my experiment came up with 46.2. This seems like a reasonable margin of error.
However, M1+C2 (similar to row 5 in Table 2) came out to be 41.3, which is very different from the reported 50.1.
In summary, FiD-xl produces the same result as the paper when used with generated documents, but the result is very different when used with retrieved documents.
Do you have any suggestions on what I'm doing wrong?
Best regards,
Eunseong,
In addition, when I tested on the FiD-large model that FiD authors provided with the contexts C2, I got an EM of 50.66.
GenRead-3B-NQ != FiD-3B-NQ -- Since the documents are under different distributions, so using GenRead-3B-NQ on DPR-retrieved documents would result in a performance drop; or versa vice, using FiD-3B-NQ on GPT-generated documents would result in a performance drop.
So, M1 + C2 is a transfer learning setting, so it would lead to a lower performance than using DPR-trained retriever + DPR-trained FiD (the 50.1 EM score).
FiD-3B-NQ model: https://huggingface.co/wyu1/FiD-3B-NQ
Thanks for the quick response! :)
I'll check it out with the new checkpoint.