Pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.

Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!

For additional details, our paper in SIGIR 2021 provides a nice overview.

⁉️ Important Note: Lucene 8 to Lucene 9 Transition

In 2022, Pyserini underwent a transition from Lucene 8 to Lucene 9. Most of the pre-built indexes have been rebuilt using Lucene 9, but there are a few still based on Lucene 8.

More details:

PyPI v0.17.1 (commit 33c87c, released 2022/08/13) is the last Pyserini release built on Lucene 8, based on Anserini v0.14.4. Thereafter, Anserini trunk was upgraded to Lucene 9.
PyPI v0.18.0 (commit 5fab14, released 2022/09/26) is built on Anserini v0.15.0, using Lucene 9. Thereafter, Pyserini trunk advanced to Lucene 9.

What's the impact? Indexes built with Lucene 8 are not fully compatible with Lucene 9 code (see Anserini #1952). The workaround is to disable consistent tie-breaking, which happens automatically if a Lucene 8 index is detected by Pyserini. However, Lucene 9 code running on Lucene 8 indexes will give slightly different results than Lucene 8 code running on Lucene 8 indexes. Note that Lucene 8 code is not able to read indexes built with Lucene 9.

Why is this necessary? Although disruptive, an upgrade to Lucene 9 is necessary to take advantage of Lucene's HNSW indexes, which will increase the capabilities of Pyserini and open up the design space of dense/sparse hybrids.

🎬 Installation

Install via PyPI (requires Python 3.8+):

pip install pyserini

Sparse retrieval depends on Anserini, which is itself built on Lucene, and thus Java 11.

Dense retrieval depends on neural networks and requires a more complex set of dependencies. A pip installation will automatically pull in the 🤗 Transformers library to satisfy the package requirements. Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements. We leave the installation of these packages to you.

The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies. We provide additional detailed installation instructions here.

If you're planning on just using Pyserini, then the pip instructions above are fine. However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation. Instructions are provided here.

🙋 How do I search?

Pyserini supports the following classes of retrieval models:

Traditional lexical models (e.g., BM25) using LuceneSearcher.
Learned sparse retrieval models (e.g., uniCOIL, SPLADE, etc.) using LuceneImpactSearcher.
Learned dense retrieval models (e.g., DPR, Contriever, etc.) using FaissSearcher.
Hybrid retrieval models (e.g., dense-sparse fusion) using HybridSearcher.

See this guide (same as the links above) for details on how to search common corpora in IR and NLP research (e.g., MS MARCO, NaturalQuestions, BEIR, etc.) using indexes that we have already built for you.

Once you get the top-k results, you'll actually want to fetch the document text... See this guide for how.

🙋 How do I index my own corpus?

Well, it depends on what type of retrieval model you want to search with:

The steps are different for different classes of models: this guide (same as the links above) describes the details.

🙋 Additional FAQs

How do I configure search? (Guide to Interactive Search)
How do I manually download indexes? (Guide to Interactive Search)
How do I perform dense and hybrid retrieval? (Guide to Interactive Search)
How do I iterate over index terms and access term statistics? (Index Reader API)
How do I traverse postings? (Index Reader API)
How do I access and manipulate term vectors? (Index Reader API)
How do I compute the tf-idf or BM25 score of a document? (Index Reader API)
How do I access basic index statistics? (Index Reader API)
How do I access underlying Lucene analyzers? (Analyzer API)
How do I build custom Lucene queries? (Query Builder API)
How do I iterate over raw collections? (Collection API)

⚗️ Reproducibility

With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! We provide a number of pre-built indexes that directly support reproducibility "out of the box".

In our SIGIR 2022 paper, we introduced "two-click reproductions" that allow anyone to reproduce experimental runs with only two clicks (i.e., copy and paste). Documentation is organized into reproduction matrices for different corpora that provide a summary of different experimental conditions and query sets:

For more details, see our paper on Building a Culture of Reproducibility in Academic Research.

Programmatic execution of the reproductions

To run the MS MARCO reproductions programmatically, see instructions on each individual page above. For all the others:

python scripts/repro_matrix/run_all_beir.py
python scripts/repro_matrix/run_all_mrtydi.py
python scripts/repro_matrix/run_all_miracl.py
python scripts/repro_matrix/run_all_odqa.py --topics nq
python scripts/repro_matrix/run_all_odqa.py --topics tqa

And to generate the nicely formatted documentation pages:

python scripts/repro_matrix/generate_html_beir.py > docs/2cr/beir.html
python scripts/repro_matrix/generate_html_mrtydi.py > docs/2cr/mrtydi.html
python scripts/repro_matrix/generate_html_miracl.py > docs/2cr/miracl.html
python scripts/repro_matrix/generate_html_odqa.py > docs/2cr/odqa.html

Additional reproduction guides below provide detailed step-by-step instructions.

Available Corpora

Corpora	Size	Checksum
MS MARCO V1 passage: uniCOIL (noexp)	2.7 GB	`f17ddd8c7c00ff121c3c3b147d2e17d8`
MS MARCO V1 passage: uniCOIL (d2q-T5)	3.4 GB	`78eef752c78c8691f7d61600ceed306f`
MS MARCO V1 doc: uniCOIL (noexp)	11 GB	`11b226e1cacd9c8ae0a660fd14cdd710`
MS MARCO V1 doc: uniCOIL (d2q-T5)	19 GB	`6a00e2c0c375cb1e52c83ae5ac377ebb`
MS MARCO V2 passage: uniCOIL (noexp)	24 GB	`d9cc1ed3049746e68a2c91bf90e5212d`
MS MARCO V2 passage: uniCOIL (d2q-T5)	41 GB	`1949a00bfd5e1f1a230a04bbc1f01539`
MS MARCO V2 doc: uniCOIL (noexp)	55 GB	`97ba262c497164de1054f357caea0c63`
MS MARCO V2 doc: uniCOIL (d2q-T5)	72 GB	`c5639748c2cbad0152e10b0ebde3b804`

📃 Additional Documentation

Baselines for KILT: a benchmark for Knowledge Intensive Language Tasks
Baselines for TripClick: a large-scale dataset of click logs in the health domain
Baselines (in Anserini) for the FEVER (Fact Extraction and VERification) dataset
Guide to pre-built indexes
Guide to interactive searching
Guide to text classification with the 20Newsgroups dataset
Guide to working with the COVID-19 Open Research Dataset (CORD-19)
Guide to working with entity linking
Guide to working with spaCy
Usage of the Analyzer API
Usage of the Index Reader API
Usage of the Query Builder API
Usage of the Collection API
Direct Interaction via Pyjnius

ℹ️ Release History

v0.21.0 (w/ Anserini v0.21.0): April 6, 2023 [Release Notes]
v0.20.0 (w/ Anserini v0.20.0): February 1, 2023 [Release Notes]
v0.19.2 (w/ Anserini v0.16.2): December 16, 2022 [Release Notes]
v0.19.1 (w/ Anserini v0.16.1): November 12, 2022 [Release Notes]
v0.19.0 (w/ Anserini v0.16.1): November 2, 2022 [Release Notes] [Known Issues]
v0.18.0 (w/ Anserini v0.15.0): September 26, 2022 [Release Notes] (First release based on Lucene 9)
v0.17.1 (w/ Anserini v0.14.4): August 13, 2022 [Release Notes] (Final release based on Lucene 8)
v0.17.0 (w/ Anserini v0.14.3): May 28, 2022 [Release Notes]
v0.16.1 (w/ Anserini v0.14.3): May 12, 2022 [Release Notes]
v0.16.0 (w/ Anserini v0.14.1): March 1, 2022 [Release Notes]
v0.15.0 (w/ Anserini v0.14.0): January 21, 2022 [Release Notes]
v0.14.0 (w/ Anserini v0.13.5): November 8, 2021 [Release Notes]
v0.13.0 (w/ Anserini v0.13.1): July 3, 2021 [Release Notes]
v0.12.0 (w/ Anserini v0.12.0): May 5, 2021 [Release Notes]
v0.11.0.0: February 18, 2021 [Release Notes]
v0.10.1.0: January 8, 2021 [Release Notes]
v0.10.0.1: December 2, 2020 [Release Notes]
v0.10.0.0: November 26, 2020 [Release Notes]
v0.9.4.0: June 26, 2020 [Release Notes]
v0.9.3.1: June 11, 2020 [Release Notes]
v0.9.3.0: May 27, 2020 [Release Notes]
v0.9.2.0: May 15, 2020 [Release Notes]
v0.9.1.0: May 6, 2020 [Release Notes]
v0.9.0.0: April 18, 2020 [Release Notes]
v0.8.1.0: March 22, 2020 [Release Notes]
v0.8.0.0: March 12, 2020 [Release Notes]
v0.7.2.0: January 25, 2020 [Release Notes]
v0.7.1.0: January 9, 2020 [Release Notes]
v0.7.0.0: December 13, 2019 [Release Notes]
v0.6.0.0: November 2, 2019

Additional technical notes

With v0.11.0.0 and before, Pyserini versions adopted the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. Starting with Anserini v0.12.0, Anserini and Pyserini versions have become decoupled.

Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.

✨ References

If you use Pyserini, please cite the following paper:

@INPROCEEDINGS{Lin_etal_SIGIR2021_Pyserini,
   author = "Jimmy Lin and Xueguang Ma and Sheng-Chieh Lin and Jheng-Hong Yang and Ronak Pradeep and Rodrigo Nogueira",
   title = "{Pyserini}: A {Python} Toolkit for Reproducible Information Retrieval Research with Sparse and Dense Representations",
   booktitle = "Proceedings of the 44th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2021)",
   year = 2021,
   pages = "2356--2362",
}

🙏 Acknowledgments

This research is supported in part by the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Cathrineee/pyserini