COL764 Information Retrieval and Web Search

Assignment 1 : Boolean Retrieval

Japneet Singh 2019MT10696

Description

A Boolean Information Retrieval System that supports the following operations:

Single keyword retrieval : Returns all document-ids whose document content contains the specified keyword. Multi-keyword retrieval : Returns all document-ids whose document content contains all the specified keywords.

Inverts the document collection and builds an on-disk inverted index, consisting of a dic- tionary file and a single file of all postings lists. Supports supplying a stop-word list, and if it is empty then does not eliminate any stop-words. Uses the Porter Stemmer (from https://tartarus.org/martin/ PorterStemmer) to stem terms before indexing, python-snappy and beautifulsoup4 in addition to native libraries, re and os.

Requirements

The following packages are required for the system to function

PorterStemmer
- Included in the directory in PorterStemmer.py
Snappy
- Python distribution available at github.com/andrix/python-snappy
- To install using PyPi, run pip install python-snappy
BeautifulSoup (bs4)
- To install using PyPi, run pip install beautifulsoup4

Included Files and Running the Code

The submission contains 2 main python files -

invidx_cons.py - Reads input data, prepocesses the relevant text and creates an on-disk inverted index with the required compression scheme To run

python3 invidx_cons.py <coll-name> <indexfile> <stopword-file> <compression> <xml-tags-info>

<coll-name>      Directory Name or Path for the directory containing all the files in the corpus. 
                 The program expects a file to have multiple xml fragments with the root tag as <DOCNO>
<indexfile>      The name of the inverted index and the dictionary files to be generated.
                 If the argument given is `invidx` then the program will generate two output files
                 - invidx.dict : The term -> (offset, list length) mapping of the data
                 - invidx.idx : The token -> document list mapping of the data
<stopword-file>  The name of the file containing the list of stopwords not to be indexed.
                 Each line of the file contains one stopword each.
<compression>    Compression to be used, denoted by a number in {0,1,2,3,4,5}
                 The detailed compression algorithm followed corresponding to each 
                 compression number is given in 2019MT10696.pdf
<xml-tags-info>  List of relevant xml tags containing the text to be indexed.
                 According to assignment specifications, first line is always DOCNO

query.py - Reads the given query file and retrieves the details of the most relevant documents using the attached index and dictionary.
To run

python3 query.py <queryfile> <resultfile> <indexfile> <dictfile>

<queryfile>      A file containing keyword queries, with each line corre- sponding to a query. 
                 Multi term queries are space seperated words in the same line.
<resultfile>     The name of the file to be generated with the results for the query.
                 Sample format is:
                 ...
                 Q0 ZF08-175-870 1.0
                 Q0 ZF08-175-871 1.0
                 Q0 ZF08-175-872 1.0
                 ...
<indexfile>      The index file generated by invidx_cons above to be used for evaluating the queries
<dictfile>       The dictionary file generated by invidx_cons to be used for evaluating the queries

Additionally, there are shell wrappers included to run these files

invidx.sh - Wrapper for invidx_cons.py

bash invidx.sh <coll-name> <indexfile> <stopword-file> <compression> <xml-tags-info>

Argument details same as for inv_idx.py

boolsearch.sh - Wrapper for running query.py

bash boolsearch.sh <queryfile> <resultfile> <indexfile> <dictfile>

Argument details same as for query.py