/lucene-clueweb-retrieval

Reproducible IR experiments with Apache Lucene

Primary LanguageJavaApache License 2.0Apache-2.0

Reproducible IR experiments with Apache Lucene

Introduction

This project analyzes the Frequency Distributions of Query Terms on ClueWeb09B Collection. The ClueWeb09B dataset consists of about 50 million English pages that were collected in January and February 2009. The dataset is used by four Web Tracks (2009, 2010, 2011, and 2012) of the TREC conference.

This project uses total 200 queries (called topics in TREC jargon) from TREC Web Tracks ran from 2009 to 2012. These queries are created in four years where each year 50 new queries (and their relevance judgments) are published by TREC.

Apache Lucene/Solr is used as a retrieval platform. Stock lucene/solr has many ranking model implementations, including: BM25, Language Models, Divergence from Randomness Models, and Information Based Models. As explained in the write-up, Flexible Ranking feature added to Lucene in Google Summer of Code 2011.

Tools

This project is a flexible framework to conduct retrieval experiments on ClueWeb09-English corpus. Different term-weighting models provided by Lucene/Solr are compared for 200 Web Track information needs.

Configuration parameters are fed to framework as a properties file. It has two main input parameters, location of input documents, other directories are created by the framework itself. Please see Standard Directory Layout.

Framework is distributed as a tar.gz file which can be generated by mvn clean package dependency:copy-dependencies assembly:single command. The tar ball includes an executable script named run.sh and config.properties containing various parameters. When ./run.sh is invoked, simple usage information is displayed. Following arguments are passed to it in order to run one of the following tools.

Standard Directory Layout

The next section documents the directory layout expected/used by this project. In general, each folder contains two outermost folders : KStem and KStemAnchor. These represent KStemming and AnchorText respectively. In folder naming convention, WT stands for Web Track, TT stands for Terabyte Track, etc.

Dependencies

  • Perl yum install perl
  • Bzip yum install bzip2
  • Million Query evaluation tool statAP_MQ_eval_v4.pl requires: yum install perl-CPAN and perl -MCPAN -e'install "LWP::Simple"'
  • Check where LWP::Simple module is installed on your system and type below line just above the use LWP::Simple statement in the statAP_MQ_eval_v4.pl file.
use lib '/home/iorixxx/perl5/lib/perl5';
use LWP::Simple;
  • JDK 1.8 or above
  • Apache Maven 3.0.3 or above
  • Apache Lucene (Solr) 6.5.0

Author

Please feel free to contact Ahmet Arslan at iorixxx@yahoo.com if you have any questions, comments or contributions.

Citation Policy

If you use this library for research purposes, please use the following citation:

@article{
  author = "Arslan, Ahmet and Din{\c{c}}er, Bekir Taner",
  title = "A selective approach to index term weighting for robust information retrieval based on the frequency distributions of query terms",
  journal = "Information Retrieval Journal",
  year = "2018",
  doi = "10.1007/s10791-018-9347-9",
  url = "https://link.springer.com/article/10.1007/s10791-018-9347-9"
}