/spud

A lucene implementation of the SPUD language model for document retrieval

Primary LanguageJava

This set of classes is a lucene implementation of the SPUD retrieval 
model that appears in "A Polya Urn Document Language Model for 
Improved Information Retrieval" by Ronan Cummins, Jiaul Hoque Paik, 
and Yuanhua Lv.



The classes depend on the following publicly available jar files:

lucene-core-5.0.0.jar
lucene-queryparser-5.0.0.jar
lucene-analyzers-common-5.0.0.jar
lucene-queries-5.0.0.jar
commons-math3-3.3.jar
jsoup-1.7.3.jar



To build the classes, create a "classes" directory at the same level as "src". 

>mkdir classes

Then run

>make all

Included in this download is the cranfield-collection (modified to the TREC format). 
The three important files for the modified cranfield collection are:

cran.all.1400.trec-format (the documents)
cran.qry.trec-format (the queries)
cran.qrels.trec-format (the qrels)


The only two classes with main methods are:
indexing.LuceneTRECIndexer
scoring.QuerySearch


To index the cranfield collection, create an index file containing the full paths of files that you wish to index.
There should be only one line in the index file for the cranfield collection. E.g. 
././cran.all.1400.trec-format


Then from the classes directory run:
>java -cp .:../lib/* indexing.LuceneTRECIndexer ../cranfield-collection/lucene_index ../cranfield-collection/index-file 1 0 

This will create the index in the "lucene_index" directory


You can then run the queries on the collection from the classes directory as follows:
>java -cp .:../lib/* searching.QuerySearch ../cranfield-collection/lucene_index ../cranfield-collection/cran.qry.trec-format ../cranfield-collection/cran.qrels.trec-format

This should run the basic spud model using the queries and also calculate some effectiveness metrics for the queries. 



Copyright © 2015 Ronan Cummins
This work is free. It comes without any warranty to the extent 
permitted by applicable law. You can redistribute it and/or modify it 
under the terms of the Do What The Fuck You Want To Public License, Version 2,
as published by Sam Hocevar. See http://www.wtfpl.net/ for more details.