WebIR-VSM

My search engine implemented by Vector Space Model and Rocchio Feedback with Python. This implementation achived 0.74349 MAP value on the given chinese news corpus.

Preprocess

file-list : Each line is a file path to every document.
vocab.all : First line is the encoding format of this file, followed by terms in this collection. Index of each term is its row number - 1

utf-8
apple
bannan
cat

inverted-file : vocab_id and file_id referred from vocab.all and file-list. vocab_id_1 vocab_id_2 denotes an unigram when vocab_id_2==-1 or a bigram when vocab_id_2!=-1. If there are N files containing vocab_id_1 vocab_id_2, there will be the number N next to vocab_id_2, followed by N lines that display the counts of this term in each file.

IO

Input file is a query.xml in the following format.

<number> : The topic number.
<title>: The topic title.
<question>: A short description about the query topic.
<narrative>: Even more verbose descriptions about the topic.
<concepts>: A set of keywords that can be used in retrieval about the topic

Output file is a result.csv, every line is query_id,retrieved_docs, like

011,doc1,doc3,doc5

Execution

execute.sh [-r] -i query-file -o output-file -m model-fir -d corpus-path

-r
	If specified, turn on relevance feedback.
-i
	query file follows the format above
-o
   output rank list
-m 
   a directory contains vocab.all,inverted-file,file-list
-d
	absolute path to the corpus directory

rabbitbrick/WebIR-VSM-Only_Okapi

WebIR-VSM

Preprocess

IO

Execution