Pyterrier

A Python API for Terrier

Installation

Linux

Make sure that JAVA_HOME environment variable is set to the location of your Java installation
pip install python-terrier

macOS

You need to hava Java installed. Pyjnius/PyTerrier will pick up the location automatically.
pip install python-terrier

Windows

Pyterrier is not available for Windows because pytrec_eval isn't available for Windows. If you can compile & install pytrec_eval youself, it should work fine.

Colab notebooks

os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"    
!pip install python-terrier

Indexing

Indexing TREC formatted collections

You can create an index from TREC formatted collection using TRECCollectionIndexer.
For TXT, PDF, Microsoft Word files, etc files you can use FilesIndexer.
For Pandas Dataframe you can use DFIndexer.

See examples at:
https://colab.research.google.com/drive/17WpzhtlMj1U2UJku-RaO2axNsUFhPI6z

Retrieval and Evaluation

topics = pt.Utils.parse_trec_topics_file(topicsFile)
qrels = pt.Utils.parse_qrels(qrelsFile)
BM25_br = pt.BatchRetrieve(index, "BM25")
res = BM25_br.transform(topics)
pt.Utils.evaluate(res, qrels, metrics = ['map'])

See examples at: https://colab.research.google.com/drive/1yime_0D21Q-KzFD4IbsRzTvjRbo9vz4I

Experiment - Perform Retrieval and Evaluation with a single function

We provide an experiment object, which allows to compare multiple retrieval approaches on the same queries & relevance assessments:

pt.Experiment(topics, [BM25_br, PL2_br], eval_metrics, qrels)

More examples are provided at: https://colab.research.google.com/drive/15oG7HwyYCBFuborjmfYglea0VLkUjyK-

Learning to Rank

First create a FeaturesBatchRetrieve(index, features) object with the desired features.

Call the transform(topics_set) function with the train, validation and test topic sets to get dataframes with the feature scores and use them to train your chosen model.

Use your trained model to predict the score of the test_topics and evaluate the result with pt.Utils.evaluate().

BM25_with_features_br = pt.BatchRetrieve(index, ["WMODEL:BM25F", "WMODEL:PL2F"], controls={"wmodel" : "BM25"})

LTR_pipeline

Create a LTR_pipeline object with arguments:

Index reference or path to index on disc
Weighting model name
Features list
Qrels
LTR model

Call the fit() method on the created object with the training topics.

Evaluate the results with the Experiment function by using the test topics

pt.LTR_pipeline(index, model, features, qrels, LTR)

More learning to rank examples are provided at: https://colab.research.google.com/drive/1KwHoahx_i0vax9fnCZpLP-JmI9jvSoey

Credits

Alex Tsolov, University of Glasgow
Craig Macdonald, University of Glasgow

tonellotto/pyterrier