spud: A Java repository from ronancummins

This set of classes is a lucene implementation of the SPUD retrieval
model that appears in "A Polya Urn Document Language Model for
Improved Information Retrieval" by Ronan Cummins, Jiaul Hoque Paik,
and Yuanhua Lv.

The classes depend on the following publicly available jar files:

lucene-core-5.0.0.jar
lucene-queryparser-5.0.0.jar
lucene-analyzers-common-5.0.0.jar
lucene-queries-5.0.0.jar
commons-math3-3.3.jar
jsoup-1.7.3.jar

To build the classes, create a "classes" directory at the same level as "src".

>mkdir classes

Then run

>make all

Included in this download is the cranfield-collection (modified to the TREC format).
The three important files for the modified cranfield collection are:

cran.all.1400.trec-format (the documents)
cran.qry.trec-format (the queries)
cran.qrels.trec-format (the qrels)

The only two classes with main methods are:
indexing.LuceneTRECIndexer
scoring.QuerySearch

To index the cranfield collection, create an index file containing the full paths of files that you wish to index.
There should be only one line in the index file for the cranfield collection. E.g.
././cran.all.1400.trec-format

Then from the classes directory run:
>java -cp .:../lib/* indexing.LuceneTRECIndexer ../cranfield-collection/lucene_index ../cranfield-collection/index-file 1 0

This will create the index in the "lucene_index" directory

You can then run the queries on the collection from the classes directory as follows:
>java -cp .:../lib/* searching.QuerySearch ../cranfield-collection/lucene_index ../cranfield-collection/cran.qry.trec-format ../cranfield-collection/cran.qrels.trec-format

This should run the basic spud model using the queries and also calculate some effectiveness metrics for the queries.

Copyright © 2015 Ronan Cummins
This work is free. It comes without any warranty to the extent
permitted by applicable law. You can redistribute it and/or modify it
under the terms of the Do What The Fuck You Want To Public License, Version 2,
as published by Sam Hocevar. See http://www.wtfpl.net/ for more details.

ronancummins/spud