Retrieval and Richness when Querying by Document
Paper will be appeared in DESIRES 2018
The experiment framework start from the ingestion of RCV1-v2 collection and the relevancy information. The detail information of the collection can be found in this site.
Detail arguments are listed in each python script by calling python {script}.py --help
.
Environment Requirement
- python >= 3.6
- Elasticsearch Server 6.2 (has not test on >6.2 but should work)
- Other python packages listed in
requirments.txt
Preprocessing Collection
All experiments will make use of the output files from the following three commands.
- Ingestion of the raw collection
python helpers.py ingest {directory to collection} {output filename}
- Parse .qrels file
python helpers.py qrels {directory containing all 3 .qrels files} {output filename} {output file from 1}
- Create query document by sampling from the relevant documents of each category
python helpers.py sample {output file from 2} {output filename}
Elasticsearch Experiment
- Index the collection in Elasticsearch
python es_indexing.py {Elasticsearch server}
detail arguments please callpython es_indexing.py --help
- Run Elasticsearch experiments
python es_exp.py {Elasticserach server} {index name} {type of experimenting query}
detail arguments please callpython es_exp.py --help
Scikit-learn Experiment
- Create document-term vector file
python vectorize.py {ingested collection} {style: tfidf/bm25}
detail arguments please callpython vectorize.py {ingested collection} {style} --help
- Dump Elasticsearch index as vector file
python es_dumpvec.py {Elasticsearch server} {index name} {output file}
- Run scikit-learn experiments
python sklearn_exp.py {category} {document vector file} {query style: similarity/onehot/other}
detail arguments please callpython sklearn_exp.py --help
Reference
Please kindly cite the following paper
Eugene Yang, David D. Lewis, Ophir Frieder, David Grossman, and Roman Yurchak. 2018. Retrieval and Richness when Querying by Document. In Proceedings of Design of Experimental Search & Information REtrieval Systems (DESIRES 2018)