Vector-Space-Retrieval(VSM)-Model

This Vector Space Retrieval Model has been implemented for the evaluation of the algorithm over a small-sized benchmark document collection from TREC, which was preprocessed using NLTK/StanfordNLP in order to tag entities such as Organisations, Locations and Persons. The small scale dataset also contains a portion of the TREC topics (i.e., queries) and their judgements (i.e., qrels) on these documents.

All the three codes(invidx.py, printdict.py and vecsearch.py) make use of the following basic python libraries/packages - os, string, math, pickle and xml.etree.ElementTree.

Usage

The programs are to be executed in the following order - invidx.py --> vecsearch.py. The prindict.py file prints the inverted index dictionary in a human readable format. The python codes will prompt the user for input in the following manner -

  • invidx.py => python invidx.py => provide 1st input as path to document collection folder (eg:- data/TaggedTrainingAP/) => provide 2nd input as the name of the index file
  • vecsearch.py => python vecsearch.py => provide 1st input as the query file (eg:- data/topics.51-100) => provide 2nd input as the cut off k value => provide 3rd input as name of the result file => provide 4th input as the index file obtained from invidx.py => provide 5th input as the dict file obtained from invidx.py
  • printdict.py => python printdict.py => provide 1st input as the dict file obtained from invidx.py

invidx.py

  • Generates inverted index files
  • Outputs two binary files 1.) indexfile.dict - Contains a dictionary with keys as uniques words present in all the documents and values as list containing the document IDs of the documents in which the key is present 2.) indexfile.idx - Contains a dictionary with keys as unique document IDs and values as dictionaries (containing keys as words present in the document and values as the term frequency f of the key)
  • The method invidx_cons present inside this file takes user input for the path of the directory conatining the collection files and the name of indexfile that has to be saved
  • Upon running the program, the console will show, "Enter collpath: "; here the path of the folder conatining the collection files has to be entered
  • After successfully giving the path to the folder containing the collection files a new dialogue will appear, "Enter indexfile: "; here the user will have to input the path/name of the indexfile that has to be saved

printdict.py

  • Prints the indexfile.dict file generated using invidx.py in a human-readable sorted form on the screen
  • Parameter to this function is the binary file, indexfile.dict, generated using the invidx.py program
  • Upon running the program, the console will show, "Enter dictfile: "; HERE the user will have to input the path to the file indeXfile.dict generated using invidx.py
  • Output format on the screen (each line) -- "::"

vecsearch.py

  • This is the implementation of vector space retrieval model and makes use of the files indexfile.idx and indexfile.dict
  • Upon executing this program, the user will have to provide 5 inputs: 1.) queryfile - input the path to the queryfile; dialogue that will appear for this is "Enter queryfile: " 2.) k - input the value of k; dialogue that will appear for this is "Enter k value: " 3.) resultfile - input the path/name to the resultfile; dialogue that will appear for this is "Enter resultfile: " 4.) indexfile.idx - input the path to the indexfile.idx generated using invidx.py; dialogue that will appear for this is "Enter indexfile: " 5.) indexfile.dict - input the path to the indexfile.dict generated using invidx.py; dialogue that will appear for this is "Enter dictfile: "
  • Outputs a resultfile in the format(each line) -- "qid iter docno rank sim run id"

Report.pdf

  • Algorithmic details documentation

Results

  • The wall clock running time of the invidx.py program is 162.67 seconds
  • The indexfile.dict contains 467793 unique words as keys and the indexfile.idx contains 81946 unique document IDs as keys
  • The size of the indexfile.dict is 108MB and the size of indexfile.idx is 321MB The ndcg and F1 scores were evaluated using trec_eval
  • The ndcg value obtained for k = 10 is 0.2322
  • The F1 score obtained for k = 100 is 0.1527