/IR-Assignment1-BooleanRetrieval

Assignment 1 for course COL764 Information Retrieval and Web Search

Primary LanguagePython

COL764 Information Retrieval and Web Search

Assignment 1 : Boolean Retrieval

Japneet Singh 2019MT10696

Description

A Boolean Information Retrieval System that supports the following operations:

Single keyword retrieval : Returns all document-ids whose document content contains the specified keyword. Multi-keyword retrieval : Returns all document-ids whose document content contains all the specified keywords.

Inverts the document collection and builds an on-disk inverted index, consisting of a dic- tionary file and a single file of all postings lists. Supports supplying a stop-word list, and if it is empty then does not eliminate any stop-words. Uses the Porter Stemmer (from https://tartarus.org/martin/ PorterStemmer) to stem terms before indexing, python-snappy and beautifulsoup4 in addition to native libraries, re and os.

Requirements

The following packages are required for the system to function

  1. PorterStemmer
    • Included in the directory in PorterStemmer.py
  2. Snappy
  3. BeautifulSoup (bs4)
    • To install using PyPi, run pip install beautifulsoup4

Included Files and Running the Code

The submission contains 2 main python files -

  1. invidx_cons.py - Reads input data, prepocesses the relevant text and creates an on-disk inverted index with the required compression scheme To run
python3 invidx_cons.py <coll-name> <indexfile> <stopword-file> <compression> <xml-tags-info> 
<coll-name>      Directory Name or Path for the directory containing all the files in the corpus. 
                 The program expects a file to have multiple xml fragments with the root tag as <DOCNO>
<indexfile>      The name of the inverted index and the dictionary files to be generated.
                 If the argument given is `invidx` then the program will generate two output files
                 - invidx.dict : The term -> (offset, list length) mapping of the data
                 - invidx.idx : The token -> document list mapping of the data
<stopword-file>  The name of the file containing the list of stopwords not to be indexed.
                 Each line of the file contains one stopword each.
<compression>    Compression to be used, denoted by a number in {0,1,2,3,4,5}
                 The detailed compression algorithm followed corresponding to each 
                 compression number is given in 2019MT10696.pdf
<xml-tags-info>  List of relevant xml tags containing the text to be indexed.
                 According to assignment specifications, first line is always DOCNO
  1. query.py - Reads the given query file and retrieves the details of the most relevant documents using the attached index and dictionary.
    To run
python3 query.py <queryfile> <resultfile> <indexfile> <dictfile>
<queryfile>      A file containing keyword queries, with each line corre- sponding to a query. 
                 Multi term queries are space seperated words in the same line.
<resultfile>     The name of the file to be generated with the results for the query.
                 Sample format is:
                 ...
                 Q0 ZF08-175-870 1.0
                 Q0 ZF08-175-871 1.0
                 Q0 ZF08-175-872 1.0
                 ...
<indexfile>      The index file generated by invidx_cons above to be used for evaluating the queries
<dictfile>       The dictionary file generated by invidx_cons to be used for evaluating the queries

Additionally, there are shell wrappers included to run these files

  1. invidx.sh - Wrapper for invidx_cons.py
bash invidx.sh <coll-name> <indexfile> <stopword-file> <compression> <xml-tags-info>

Argument details same as for inv_idx.py
  1. boolsearch.sh - Wrapper for running query.py
bash boolsearch.sh <queryfile> <resultfile> <indexfile> <dictfile>

Argument details same as for query.py
  1. build.sh - An empty file as no building/compiling required for python files.