/ProxLogPRF

ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback

Primary LanguageJava

ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback

This is the official repository of the manuscript "ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback" submitted to Information Processing & Management (IP&M).

Updates

  • Jan 23, 2022: the project has been released to GitHub.

Requirements

  • java 1.7 -- development environment
  • some necessary *.jar packages -- we include them in the 'lib' folder.
  • pandas -- used by trecEval.py

Experimentation instructions

  • Step 1: create the index for each data collection (e.g. WT2G).

    $ javac -cp "lib/*:" ./*.java                                             # compile
    $ java -cp "lib/*:" index.IndexTREC -docs datasets/WT2G/ -data WT2G       # create index for WT2G
    
  • Step 2: retrieve the documents using each model (e.g. BM25). The ranking results will be generated to a file named *-report.txt under 'ProxLogPRF/result'. In addition, the *.txt file containing the metric results in terms of MAP will also be generated under 'ProxLogPRF/result' - this result will be used to find the best model.

    $ javac -cp "lib/*:" ./*.java                                             # compile
    $ java -cp "lib/*:" models.BM25 -k1 1.2 -b 0.35                          # use BM25 model to retrieve documents
    $ java -cp "lib/*:" models.BM25 -h                                       # use this command to check arguments usages
    
  • Step 3: evaluate the retrieval model - we use trecEval.py to evaluate the model performance in terms of MAP, P@20, nDCG and nDCG@20.

    $ python trecEval.py result/BM25/WT2G-BM25-1.2-0.35-report.txt query-judge/qrels.WT2G result/WT2G-BM25-1.2-0.35.xls
    

Package structure

  • analyzer
    • MyStopAndStemmingAnalyzer.java: stopwords removal and stemming.
  • common
    • ByWeightComparator.java: numerical comparator.
    • MyQQParser.java: simplistic quality query parser.
    • MyTrecParser.java: TREC document parser.
    • QualityStats.java: compute the results (MAP, P@k and MRR) of quality benchmark run for a single query or for a set of queries.
    • StaTools.java: implementation on some basic statistical functions
  • datasets: directory to the data collections
  • index
    • IndexTREC.java: create index for data collections
  • indices: directory to the files containing index of each data collection
  • lib: directory to all the *.jar packages used for the project
  • models: directory to all the retrieval models (i.e. BM25, DLM, LL, LLPRF, LLEXPStar (LL+EXP*), PRoc2, PRoc3 and ProxLogPRF)
  • query-judge: directory to all the query topics
  • result: directory to the experimental results
  • stopwords.txt: stopwords used in our experiments
  • trecEval.py: evaluation metrics

Data collections

We tested baselines, SOTA proximity-based PRF models and our model variants on eight standard TREC collections, namely AP (Associated Press 1988-90), DISK1&2, DISK4&5, ROBUST04 (TREC Robust Track 2004), WSJ (Wall Street Journal), WT2G (TREC Web Track 2000), WT10G (TREC Web Track 2001- 2002) and GOV2. Note that AP, DISK1&2, DISK4&5, ROBUST04 and WSJ are popular newswire collections where noise is rare, while WT2G, WT10G and GOV2 are collections consisting of web documents with inherent noises.

Acknowledgments

This research have been supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, the York Research Chairs (YRC) program, the NSERC CREATE award and an ORF-RE (Ontario Research Fund Research Excellence award) in BRAIN Alliance.