Unable to reproduce table 3 (Gold-127N) from scratch

Question

Unable to reproduce table 3 (Gold-127N) from scratch

xhluca opened this issue 2 years ago · 1 comments

I followed the exact approach described in the paper, and I also reused parts of the code in this repository. After training the model using the specified hyperparameter, I obtained the following for Gold with 127 negative samples (i.e. 128 batch size):

==================================================
Accuracies on split=valid
--------------------------------------------------
Top-1: 0.2521
Top-5: 0.5184
Top-20: 0.7033
Top-100: 0.8161
==================================================
Accuracies on split=test
--------------------------------------------------
Top-1: 0.2551
Top-5: 0.5341
Top-20: 0.7119
Top-100: 0.8305

I did not use bm25-retrieved hard negatives. The top-20 and top-100 are lower than reported. Is that normal?

The validation results from the paper:

The training code I used can be found here: https://gist.github.com/xhluca/28181468e3907145027969a1003ae929

EDIT: I realized that I was using a naive string matching and using the wrong file (should have used nq.qas instead of biencoder), so I instead started using the evaluation code provided by this repository. I did a sanity check by building an index using the huggingface DPRContextEncoder, and the results were off by ~0.5% from G+BM25 127+128, which seems quite reasonable (and likely indicate there's no issue how I copy-pasted the qa_validation.py code).

So, I updated the results above, and I notice now that the results on dev set are off by 1.5% for top-100, then 3% for top-20 and 4% on top-5. I doubt the issue is with the evaluation, nor the build_index script (as both of those were used in the sanity check in earlier). But nothing looks wrong with the training code below.

Answer 1 · 2022-11-03T23:53:56.000Z

I moved to hard negatives so this is no longer an issue for me