This project is build by students of the University of Mannheim for the master course Information Retrieval (IE 681).
What do you need to run this project?
- numpy
- nltk (stopwords, PorterStemmer, WordnetLemmatizer)
- pandas
- sklearn
- trectools
- trec-car-tool
- xgboost
For L2R, we used the java library by Lemur. See:
The main class is the which computes all scores and run the metrics functions. The class contains several variables that act as parameters that you can tune. Additionally, if you want to run a different L2R model, you should change it in the in the execute_L2R_task function.
All pre-processing steps are implemented in the It contains a Preprocess class that do the lemmatization/stemming, stopword removal, ...
This class creates our synthetic HTML Wiki page.
The VSM model TF-IDF is implemented as a class in the The probabilistic BM25 model is implemented as a class in the
The contains the FeatureGenerator class. This class provides functions to generate all the scores for our retrieval models. This includes the two sub-tasks:
- Scores for query-paragraph retrieval (used as features for L2R)
- Scores for paragraph-paragraph retrieval
To cache the glove vectors yourself, you will need the glove.840B.300d.txt or an other set. In that case change the target file in The contains the implementation to parse the pretrained GloVe file and extract the semantic word embedding vector. See:
The contains functions that takes out retrieval models' scores and convert them into a file format readable as features for L2R.
Our scorer implementation can be found in the We implemented for our data structures scorers for Accuracy, Precision, Recall, F1-Score in the Standard class. The metrics class contains scorers for Mean Average Precision (MAP) and Mean Reciprocal Rank (MRR).