My search engine implemented by Vector Space Model and Rocchio Feedback with Python. This implementation achived 0.74349 MAP value on the given chinese news corpus.
- file-list : Each line is a file path to every document.
- vocab.all : First line is the encoding format of this file, followed by terms in this collection. Index of each term is its row number - 1
utf-8
apple
bannan
cat
- inverted-file : vocab_id and file_id referred from vocab.all and file-list. vocab_id_1 vocab_id_2 denotes an unigram when vocab_id_2==-1 or a bigram when vocab_id_2!=-1. If there are N files containing vocab_id_1 vocab_id_2, there will be the number N next to vocab_id_2, followed by N lines that display the counts of this term in each file.
Input file is a query.xml
in the following format.
<number> : The topic number.
<title>: The topic title.
<question>: A short description about the query topic.
<narrative>: Even more verbose descriptions about the topic.
<concepts>: A set of keywords that can be used in retrieval about the topic
Output file is a result.csv
, every line is query_id,retrieved_docs, like
011,doc1,doc3,doc5
execute.sh [-r] -i query-file -o output-file -m model-fir -d corpus-path
-r
If specified, turn on relevance feedback.
-i
query file follows the format above
-o
output rank list
-m
a directory contains vocab.all,inverted-file,file-list
-d
absolute path to the corpus directory