A Boolean Information Retrieval System that supports the following operations:
Single keyword retrieval : Returns all document-ids whose document content contains the specified keyword. Multi-keyword retrieval : Returns all document-ids whose document content contains all the specified keywords.
Inverts the document collection and builds an on-disk inverted index, consisting of a dic- tionary file and a single file of all postings lists. Supports supplying a stop-word list, and if it is empty then does not eliminate any stop-words.
Uses the Porter Stemmer (from https://tartarus.org/martin/ PorterStemmer) to stem terms before indexing, python-snappy
and beautifulsoup4
in addition to native libraries, re
and os
.
The following packages are required for the system to function
- PorterStemmer
- Included in the directory in
PorterStemmer.py
- Included in the directory in
- Snappy
- Python distribution available at github.com/andrix/python-snappy
- To install using PyPi, run
pip install python-snappy
- BeautifulSoup (bs4)
- To install using PyPi, run
pip install beautifulsoup4
- To install using PyPi, run
The submission contains 2 main python files -
invidx_cons.py
- Reads input data, prepocesses the relevant text and creates an on-disk inverted index with the required compression scheme To run
python3 invidx_cons.py <coll-name> <indexfile> <stopword-file> <compression> <xml-tags-info>
<coll-name> Directory Name or Path for the directory containing all the files in the corpus.
The program expects a file to have multiple xml fragments with the root tag as <DOCNO>
<indexfile> The name of the inverted index and the dictionary files to be generated.
If the argument given is `invidx` then the program will generate two output files
- invidx.dict : The term -> (offset, list length) mapping of the data
- invidx.idx : The token -> document list mapping of the data
<stopword-file> The name of the file containing the list of stopwords not to be indexed.
Each line of the file contains one stopword each.
<compression> Compression to be used, denoted by a number in {0,1,2,3,4,5}
The detailed compression algorithm followed corresponding to each
compression number is given in 2019MT10696.pdf
<xml-tags-info> List of relevant xml tags containing the text to be indexed.
According to assignment specifications, first line is always DOCNO
query.py
- Reads the given query file and retrieves the details of the most relevant documents using the attached index and dictionary.
To run
python3 query.py <queryfile> <resultfile> <indexfile> <dictfile>
<queryfile> A file containing keyword queries, with each line corre- sponding to a query.
Multi term queries are space seperated words in the same line.
<resultfile> The name of the file to be generated with the results for the query.
Sample format is:
...
Q0 ZF08-175-870 1.0
Q0 ZF08-175-871 1.0
Q0 ZF08-175-872 1.0
...
<indexfile> The index file generated by invidx_cons above to be used for evaluating the queries
<dictfile> The dictionary file generated by invidx_cons to be used for evaluating the queries
Additionally, there are shell wrappers included to run these files
invidx.sh
- Wrapper for invidx_cons.py
bash invidx.sh <coll-name> <indexfile> <stopword-file> <compression> <xml-tags-info>
Argument details same as for inv_idx.py
boolsearch.sh
- Wrapper for running query.py
bash boolsearch.sh <queryfile> <resultfile> <indexfile> <dictfile>
Argument details same as for query.py
build.sh
- An empty file as no building/compiling required for python files.