can not reproduce results in the paper
Opened this issue · 2 comments
I run your instructions on the openbookqa task and got the following results:
full cache / dense:
"openbookqa": { "acc": 0.414, "acc_stderr": 0.02204949796982787, "acc_norm": 0.458, "acc_norm_stderr": 0.022303966774269938 }
streamingllm:
"openbookqa": { "acc": 0.256, "acc_stderr": 0.019536923574747588, "acc_norm": 0.342, "acc_norm_stderr": 0.02123614719989926 }
h2o:
"openbookqa": { "acc": 0.264, "acc_stderr": 0.01973288558592208, "acc_norm": 0.348, "acc_norm_stderr": 0.0213237286328075 }
cam:
"openbookqa": { "acc": 0.31, "acc_stderr": 0.020704041021724795, "acc_norm": 0.352, "acc_norm_stderr": 0.021380042385946055 }
I think it might not be problems of experiment environment. I run the official repo of H2O and got almost the same scores of 5-shot evaluation as their paper.
What ratio did you set? In openbookqa dataset, it provides 4 options for the model to choose. That means even without cache, the base acc is 25%.
Both the start-ratio and recent-ratio are 0.1. And in the 0-shot setting.