/trec-ltr-thesis

The official repository for my bachelor thesis.

Primary LanguageJupyter Notebook

TREC LTR Thesis

This repository contains the code for my thesis on Learning to Rank (LTR) with the TREC Deep Learning Track dataset. The project consists of a set of Jupyter notebooks and helper functions used to download, preprocess, index, and rank the TREC dataset using various LTR models in Apache Solr.

Folder Structure

The repository has the following folder structure:

/loggedFeatures: contains the feature vectors generated during the feature engineering process for each query-document pair. The feature vectors are stored as csv files.

/ltr: contains helper functions used throughout the project. The functions include data loading and config informations.

/solr: contains the scripts for interacting with Apache Solr.

/submissions: contains the final result files generated by the various LTR models and all the corresponding data that was used to create the models, including Apache Solr configsets, training and testing data and models in json format.

Notebooks

The Jupyter notebooks are numbered in the order in which they should be executed. Here is a brief explanation of each notebook:

1_dataDownloader.ipynb: downloads the TREC dataset and the relevance judgments.

2_preprocessing.py: cleans and preprocesses the TREC text dataset.

3_preprocessingJudgments.py: preprocesses the relevance judgments.

4_indexingDataInSolr.py: creates the index schema and indexes the preprocessed TREC dataset into Solr.

5_addStopWords.py: adds stop words to the Solr configuration to improve search quality.

6_featureEngineering.py: creates the feature store in Apache Solr.

7_featureLogging.py: logs the feature vectors to disk for later use.

8_RankSVM_min_max.ipynb: trains the RankSVM model using min-max normalization.

9_Ranknet_Keras_min_max.ipynb: trains the RankNet model using Keras and min-max normalization.

10_Evaluation_BM25.ipynb: evaluates the performance of the system using BM25 ranking.

11_Evaluation_svm.ipynb: evaluates the RankSVM model.

12_Evaluation_ranknet.ipynb: evaluates the RankNet model.