Unable to reproduce table 3 (Gold-127N) from scratch
xhluca opened this issue · 1 comments
I followed the exact approach described in the paper, and I also reused parts of the code in this repository. After training the model using the specified hyperparameter, I obtained the following for Gold with 127 negative samples (i.e. 128 batch size):
==================================================
Accuracies on split=valid
--------------------------------------------------
Top-1: 0.2521
Top-5: 0.5184
Top-20: 0.7033
Top-100: 0.8161
==================================================
Accuracies on split=test
--------------------------------------------------
Top-1: 0.2551
Top-5: 0.5341
Top-20: 0.7119
Top-100: 0.8305
I did not use bm25-retrieved hard negatives. The top-20 and top-100 are lower than reported. Is that normal?
The validation results from the paper:
The training code I used can be found here: https://gist.github.com/xhluca/28181468e3907145027969a1003ae929
EDIT: I realized that I was using a naive string matching and using the wrong file (should have used nq.qas instead of biencoder), so I instead started using the evaluation code provided by this repository. I did a sanity check by building an index using the huggingface DPRContextEncoder
, and the results were off by ~0.5% from G+BM25 127+128
, which seems quite reasonable (and likely indicate there's no issue how I copy-pasted the qa_validation.py
code).
So, I updated the results above, and I notice now that the results on dev set are off by 1.5% for top-100, then 3% for top-20 and 4% on top-5. I doubt the issue is with the evaluation, nor the build_index
script (as both of those were used in the sanity check in earlier). But nothing looks wrong with the training code below.
I moved to hard negatives so this is no longer an issue for me