taolei87/rcnn

Disparity in BM25 performance

KrishnenduGhosh opened this issue · 3 comments

Hi Tao Lei,

Recently I was trying to develop a Lucene based BM25 baseline method using the Askubunbtu dataset you provided. While writing the indexwriter I used title+body from all the 167765 questions and while testing I searched for title+body for all the 189 queries (11 queries have no similar questions). The indexsearcher similarity I set as BM25similarity in Apache Lucene 6.1.0. I have used all Lucene settings as default apart from the analyzer (EnglishAnalyzer).

But the problem is: I am getting a MAP value of around 0.11 which is not at all comparable to the performance you mentioned for BM25. Hence, I feel that somewhere I am missing some steps. Can you please help me in that issue?

Hi @KrishnenduGhosh

Rish (the second author) worked on the BM baseline method for this project. I remember that he spent quite a bit of time tuning the BM baseline and preprocessing.

Could you email Rish (hrishjoshi2@gmail.com and hjoshi@mit.edu) for more information about the Lucene set-up? Sorry about the inconvenience.

@KrishnenduGhosh I could also help to email him as well. Just let me know your email address.