This Vector Space Retrieval Model has been implemented for the evaluation of the algorithm over a small-sized benchmark document collection from TREC, which was preprocessed using NLTK/StanfordNLP in order to tag entities such as Organisations, Locations and Persons. The small scale dataset also contains a portion of the TREC topics (i.e., queries) and their judgements (i.e., qrels) on these documents.
All the three codes(invidx.py, printdict.py and vecsearch.py) make use of the following basic python libraries/packages - os, string, math, pickle and xml.etree.ElementTree.
The programs are to be executed in the following order - invidx.py --> vecsearch.py. The prindict.py file prints the inverted index dictionary in a human readable format. The python codes will prompt the user for input in the following manner -
- invidx.py => python invidx.py => provide 1st input as path to document collection folder (eg:- data/TaggedTrainingAP/) => provide 2nd input as the name of the index file
- vecsearch.py => python vecsearch.py => provide 1st input as the query file (eg:- data/topics.51-100) => provide 2nd input as the cut off k value => provide 3rd input as name of the result file => provide 4th input as the index file obtained from invidx.py => provide 5th input as the dict file obtained from invidx.py
- printdict.py => python printdict.py => provide 1st input as the dict file obtained from invidx.py
- Generates inverted index files
- Outputs two binary files 1.) indexfile.dict - Contains a dictionary with keys as uniques words present in all the documents and values as list containing the document IDs of the documents in which the key is present 2.) indexfile.idx - Contains a dictionary with keys as unique document IDs and values as dictionaries (containing keys as words present in the document and values as the term frequency f of the key)
- The method invidx_cons present inside this file takes user input for the path of the directory conatining the collection files and the name of indexfile that has to be saved
- Upon running the program, the console will show, "Enter collpath: "; here the path of the folder conatining the collection files has to be entered
- After successfully giving the path to the folder containing the collection files a new dialogue will appear, "Enter indexfile: "; here the user will have to input the path/name of the indexfile that has to be saved
- Prints the indexfile.dict file generated using invidx.py in a human-readable sorted form on the screen
- Parameter to this function is the binary file, indexfile.dict, generated using the invidx.py program
- Upon running the program, the console will show, "Enter dictfile: "; HERE the user will have to input the path to the file indeXfile.dict generated using invidx.py
- Output format on the screen (each line) -- "::"
- This is the implementation of vector space retrieval model and makes use of the files indexfile.idx and indexfile.dict
- Upon executing this program, the user will have to provide 5 inputs: 1.) queryfile - input the path to the queryfile; dialogue that will appear for this is "Enter queryfile: " 2.) k - input the value of k; dialogue that will appear for this is "Enter k value: " 3.) resultfile - input the path/name to the resultfile; dialogue that will appear for this is "Enter resultfile: " 4.) indexfile.idx - input the path to the indexfile.idx generated using invidx.py; dialogue that will appear for this is "Enter indexfile: " 5.) indexfile.dict - input the path to the indexfile.dict generated using invidx.py; dialogue that will appear for this is "Enter dictfile: "
- Outputs a resultfile in the format(each line) -- "qid iter docno rank sim run id"
- Algorithmic details documentation
- The wall clock running time of the invidx.py program is 162.67 seconds
- The indexfile.dict contains 467793 unique words as keys and the indexfile.idx contains 81946 unique document IDs as keys
- The size of the indexfile.dict is 108MB and the size of indexfile.idx is 321MB The ndcg and F1 scores were evaluated using trec_eval
- The ndcg value obtained for k = 10 is 0.2322
- The F1 score obtained for k = 100 is 0.1527