ProxLogPRF: A Java repository from JeremyLeiLiu

ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback

This is the official repository of the manuscript "ProxLogPRF: A Proximity-based Log-logistic Feedback Model for Pseudo-relevance Feedback" submitted to Information Processing & Management (IP&M).

Updates

Jan 23, 2022: the project has been released to GitHub.

Requirements

java 1.7 -- development environment
some necessary *.jar packages -- we include them in the 'lib' folder.
pandas -- used by trecEval.py

Experimentation instructions

Step 1: create the index for each data collection (e.g. WT2G).

$ javac -cp "lib/*:" ./*.java                                             # compile
$ java -cp "lib/*:" index.IndexTREC -docs datasets/WT2G/ -data WT2G       # create index for WT2G

Step 2: retrieve the documents using each model (e.g. BM25). The ranking results will be generated to a file named *-report.txt under 'ProxLogPRF/result'. In addition, the *.txt file containing the metric results in terms of MAP will also be generated under 'ProxLogPRF/result' - this result will be used to find the best model.
```
$ javac -cp "lib/*:" ./*.java                                             # compile
$ java -cp "lib/*:" models.BM25 -k1 1.2 -b 0.35                          # use BM25 model to retrieve documents
$ java -cp "lib/*:" models.BM25 -h                                       # use this command to check arguments usages
```
Step 3: evaluate the retrieval model - we use trecEval.py to evaluate the model performance in terms of MAP, P@20, nDCG and nDCG@20.
```
$ python trecEval.py result/BM25/WT2G-BM25-1.2-0.35-report.txt query-judge/qrels.WT2G result/WT2G-BM25-1.2-0.35.xls
```

Package structure

analyzer
- MyStopAndStemmingAnalyzer.java: stopwords removal and stemming.
common
- ByWeightComparator.java: numerical comparator.
- MyQQParser.java: simplistic quality query parser.
- MyTrecParser.java: TREC document parser.
- QualityStats.java: compute the results (MAP, P@k and MRR) of quality benchmark run for a single query or for a set of queries.
- StaTools.java: implementation on some basic statistical functions
datasets: directory to the data collections
index
- IndexTREC.java: create index for data collections
indices: directory to the files containing index of each data collection
lib: directory to all the *.jar packages used for the project
models: directory to all the retrieval models (i.e. BM25, DLM, LL, LLPRF, LLEXPStar (LL+EXP*), PRoc2, PRoc3 and ProxLogPRF)
query-judge: directory to all the query topics
result: directory to the experimental results
stopwords.txt: stopwords used in our experiments
trecEval.py: evaluation metrics

Data collections

We tested baselines, SOTA proximity-based PRF models and our model variants on eight standard TREC collections, namely AP (Associated Press 1988-90), DISK1&2, DISK4&5, ROBUST04 (TREC Robust Track 2004), WSJ (Wall Street Journal), WT2G (TREC Web Track 2000), WT10G (TREC Web Track 2001- 2002) and GOV2. Note that AP, DISK1&2, DISK4&5, ROBUST04 and WSJ are popular newswire collections where noise is rare, while WT2G, WT10G and GOV2 are collections consisting of web documents with inherent noises.

Acknowledgments

This research have been supported by the Natural Sciences and Engineering Research Council (NSERC) of Canada, the York Research Chairs (YRC) program, the NSERC CREATE award and an ORF-RE (Ontario Research Fund Research Excellence award) in BRAIN Alliance.