Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations. Retrieval using sparse representations is provided via integration with our group's Anserini IR toolkit, which is built on Lucene. Retrieval using dense representations is provided via integration with Facebook's Faiss library.
Pyserini is primarily designed to provide effective, reproducible, and easy-to-use first-stage retrieval in a multi-stage ranking architecture. Our toolkit is self-contained as a standard Python package and comes with queries, relevance judgments, pre-built indexes, and evaluation scripts for many commonly used IR test collections
With Pyserini, it's easy to reproduce runs on a number of standard IR test collections! A low-effort way to try things out is to look at our online notebooks, which will allow you to get started with just a few clicks.
Install via PyPI (requires Python 3.6+):
pip install pyserini==0.12.0
Sparse retrieval depends on Anserini, which is itself built on Lucene, and thus Java 11.
Dense retrieval depends on neural networks and requires a more complex set of dependencies.
A pip
installation will automatically pull in the 🤗 Transformers library to satisfy the package requirements.
Pyserini also depends on PyTorch and Faiss, but since these packages may require platform-specific custom configuration, they are not explicitly listed in the package requirements.
We leave the installation of these packages to you.
The software ecosystem is rapidly evolving and a potential source of frustration is incompatibility among different versions of underlying dependencies. We provide additional detailed installation instructions here.
If you're planning on just using Pyserini, then the pip
instructions above are fine.
However, if you're planning on contributing to the codebase or want to work with the latest not-yet-released features, you'll need a development installation.
For this, clone our repo with the --recurse-submodules
option to make sure the tools/
submodule also gets cloned.
The tools/
directory, which contains evaluation tools and scripts, is actually this repo, integrated as a Git submodule (so that it can be shared across related projects).
Build as follows (you might get warnings, but okay to ignore):
cd tools/eval && tar xvfz trec_eval.9.0.4.tar.gz && cd trec_eval.9.0.4 && make && cd ../../..
cd tools/eval/ndeval && make && cd ../../..
Next, you'll need to clone and build Anserini.
It makes sense to put both pyserini/
and anserini/
in a common folder.
After you've successfully built Anserini, copy the fatjar, which will be target/anserini-X.Y.Z-SNAPSHOT-fatjar.jar
into pyserini/resources/jars/
.
As with the pip
installation, a potential source of frustration is incompatibility among different versions of underlying dependencies.
For these and other issues, we provide additional detailed installation instructions here.
You can confirm everything is working by running the unit tests:
python -m unittest
Assuming all tests pass, you should be ready to go!
- How do I search?
- How do I fetch a document?
- How do I index and search my own documents?
- How do I reproduce results on Robust04, MS MARCO...?
- How do I configure search? (Guide to Interactive Search)
- How do I manually download indexes? (Guide to Interactive Search)
- How do I perform dense and hybrid retrieval? (Guide to Interactive Search)
- How do I iterate over index terms and access term statistics? (Index Reader API)
- How do I traverse postings? (Index Reader API)
- How do I access and manipulate term vectors? (Index Reader API)
- How do I compute the tf-idf or BM25 score of a document? (Index Reader API)
- How do I access basic index statistics? (Index Reader API)
- How do I access underlying Lucene analyzers? (Analyzer API)
- How do I build custom Lucene queries? (Query Builder API)
- How do I iterate over raw collections? (Collection API)
Pyserini supports sparse retrieval (e.g., BM25 ranking using bag-of-words representations), dense retrieval (e.g., nearest-neighbor search on transformer-encoded representations), as well hybrid retrieval that integrates both approaches.
The SimpleSearcher
class provides the entry point for sparse retrieval using bag-of-words representations.
Anserini supports a number of pre-built indexes for common collections that it'll automatically download for you and store in ~/.cache/pyserini/indexes/
.
Here's how to use a pre-built index for the MS MARCO passage ranking task and issue a query interactively:
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
hits = searcher.search('what is a lobster roll?')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')
The results should be as follows:
1 7157707 11.00830
2 6034357 10.94310
3 5837606 10.81740
4 7157715 10.59820
5 6034350 10.48360
6 2900045 10.31190
7 7157713 10.12300
8 1584344 10.05290
9 533614 9.96350
10 6234461 9.92200
To further examine the results:
# Grab the raw text:
hits[0].raw
# Grab the raw Lucene Document:
hits[0].lucene_document
Pre-built Anserini indexes are hosted at the University of Waterloo's GitLab and mirrored on Dropbox. The following method will list available pre-built indexes:
SimpleSearcher.list_prebuilt_indexes()
A description of what's available can be found here. Alternatively, see this answer for how to download an index manually.
The SimpleDenseSearcher
class provides the entry point for dense retrieval, and its usage is quite similar to SimpleSearcher
.
The only additional thing we need to specify for dense retrieval is the query encoder.
from pyserini.dsearch import SimpleDenseSearcher, TctColBertQueryEncoder
encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
searcher = SimpleDenseSearcher.from_prebuilt_index(
'msmarco-passage-tct_colbert-hnsw',
encoder
)
hits = searcher.search('what is a lobster roll')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')
If you encounter an error (on macOS), you'll need the following:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='True'
The results should be as follows:
1 7157710 70.53742
2 7157715 70.50040
3 7157707 70.13804
4 6034350 69.93666
5 6321969 69.62683
6 4112862 69.34587
7 5515474 69.21354
8 7157708 69.08416
9 6321974 69.06841
10 2920399 69.01737
The HybridSearcher
class provides the entry point to perform hybrid sparse-dense retrieval:
from pyserini.search import SimpleSearcher
from pyserini.dsearch import SimpleDenseSearcher, TctColBertQueryEncoder
from pyserini.hsearch import HybridSearcher
ssearcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
encoder = TctColBertQueryEncoder('castorini/tct_colbert-msmarco')
dsearcher = SimpleDenseSearcher.from_prebuilt_index(
'msmarco-passage-tct_colbert-hnsw',
encoder
)
hsearcher = HybridSearcher(dsearcher, ssearcher)
hits = hsearcher.search('what is a lobster roll')
for i in range(0, 10):
print(f'{i+1:2} {hits[i].docid:7} {hits[i].score:.5f}')
The results should be as follows:
1 7157715 71.56022
2 7157710 71.52962
3 7157707 71.23887
4 6034350 70.98502
5 6321969 70.61903
6 4112862 70.33807
7 5515474 70.20574
8 6034357 70.11168
9 5837606 70.09911
10 7157708 70.07636
In general, hybrid retrieval will be more effective than dense retrieval, which will be more effective than sparse retrieval.
Another commonly used feature in Pyserini is to fetch a document (i.e., its text) given its docid
.
This is easy to do:
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher.from_prebuilt_index('msmarco-passage')
doc = searcher.doc('7157715')
From doc
, you can access its contents
as well as its raw
representation.
The contents
hold the representation of what's actually indexed; the raw
representation is usually the original "raw document".
A simple example can illustrate this distinction: for an article from CORD-19, raw
holds the complete JSON of the article, which obviously includes the article contents, but has metadata and other information as well.
The contents
contain extracts from the article that's actually indexed (for example, the title and abstract).
In most cases, contents
can be deterministically reconstructed from raw
.
When building the index, we specify flags to store contents
and/or raw
; it is rarely the case that we store both, since that would be a waste of space.
In the case of the pre-built msmacro-passage
index, we only store raw
.
Thus:
# Document contents: what's actually indexed.
# Note, this is not stored in the pre-built msmacro-passage index.
doc.contents()
# Raw document
doc.raw()
As you'd expected, doc.id()
returns the docid
, which is 7157715
in this case.
Finally, doc.lucene_document()
returns the underlying Lucene Document
(i.e., a Java object).
With that, you get direct access to the complete Lucene API for manipulating documents.
Since each text in the MS MARCO passage corpus is a JSON object, we can read the document into Python and manipulate:
import json
json_doc = json.loads(doc.raw())
json_doc['contents']
# 'contents' of the document:
# A Lobster Roll is a bread roll filled with bite-sized chunks of lobster meat...
Every document has a docid
, of type string, assigned by the collection it is part of.
In addition, Lucene assigns each document a unique internal id (confusingly, Lucene also calls this the docid
), which is an integer numbered sequentially starting from zero to one less than the number of documents in the index.
This can be a source of confusion but the meaning is usually clear from context.
Where there may be ambiguity, we refer to the external collection docid
and Lucene's internal docid
to be explicit.
Programmatically, the two are distinguished by type: the first is a string and the second is an integer.
As an important side note, Lucene's internal docid
s are not stable across different index instances.
That is, in two different index instances of the same collection, Lucene is likely to have assigned different internal docid
s for the same document.
This is because the internal docid
s are assigned based on document ingestion order; this will vary due to thread interleaving during indexing (which is usually performed on multiple threads).
The doc
method in searcher
takes either a string (interpreted as an external collection docid
) or an integer (interpreted as Lucene's internal docid
) and returns the corresponding document.
Thus, a simple way to iterate through all documents in the collection (and for example, print out its external collection docid
) is as follows:
for i in range(searcher.num_docs):
print(searcher.doc(i).docid())
To build sparse (i.e., Lucene inverted indexes) on your own document collections, following the instructions below. To build dense indexes (e.g., the output of transformer encoders) on your own document collections, see instructions here. The following covers English documents; if you want to index and search multilingual documents, check out this answer.
Pyserini (via Anserini) provides ingestors for document collections in many different formats. The simplest, however, is the following JSON format:
{
"id": "doc1",
"contents": "this is the contents."
}
A document is simply comprised of two fields, a docid
and contents
.
Pyserini accepts collections comprised of these documents organized in three different ways:
- Folder with each JSON in its own file, like this.
- Folder with files, each of which contains an array of JSON documents, like this.
- Folder with files, each of which contains a JSON on an individual line, like this (often called JSONL format).
So, the quickest way to get started is to write a script that converts your documents into the above format. Then, you can invoke the indexer (here, we're indexing JSONL, but any of the other formats work as well):
python -m pyserini.index -collection JsonCollection \
-generator DefaultLuceneDocumentGenerator \
-threads 1 \
-input integrations/resources/sample_collection_jsonl \
-index indexes/sample_collection_jsonl \
-storePositions -storeDocvectors -storeRaw
Three options control the type of index that is built:
-storePositions
: builds a standard positional index-storeDocvectors
: stores doc vectors (required for relevance feedback)-storeRaw
: stores raw documents
If you don't specify any of the three options above, Pyserini builds an index that only stores term frequencies. This is sufficient for simple "bag of words" querying (and yields the smallest index size).
Once indexing is done, you can use SimpleSearcher
to search the index:
from pyserini.search import SimpleSearcher
searcher = SimpleSearcher('indexes/sample_collection_jsonl')
hits = searcher.search('document')
for i in range(len(hits)):
print(f'{i+1:2} {hits[i].docid:4} {hits[i].score:.5f}')
You should get something like the following:
1 doc2 0.25620
2 doc3 0.23140
If you want to perform a batch retrieval run (e.g., directly from the command line), organize all your queries in a tsv file, like here.
The format is simple: the first field is a query id, and the second field is the query itself.
Note that the file extension must end in .tsv
so that Pyserini knows what format the queries are in.
Then, you can run:
$ python -m pyserini.search --topics integrations/resources/sample_queries.tsv \
--index indexes/sample_collection_jsonl \
--output run.sample.txt \
--bm25
$ cat run.sample.txt
1 Q0 doc2 1 0.256200 Anserini
1 Q0 doc3 2 0.231400 Anserini
2 Q0 doc1 1 0.534600 Anserini
3 Q0 doc1 1 0.256200 Anserini
3 Q0 doc2 2 0.256199 Anserini
4 Q0 doc3 1 0.483000 Anserini
Note that output run file is in standard TREC format.
You can also add extra fields in your documents when needed, e.g. text features.
For example, the SpaCy Named Entity Recognition (NER) result of contents
could be stored as an additional field NER
.
{
"id": "doc1",
"contents": "The Manhattan Project and its atomic bomb helped bring an end to World War II. Its legacy of peaceful uses of atomic energy continues to have an impact on history and science.",
"NER": {
"ORG": ["The Manhattan Project"],
"MONEY": ["World War II"]
}
}
With Pyserini, it's easy to reproduce runs on a number of standard IR test collections!
- Reproducing runs directly from the Python package
- Guide to reproducing the BM25 baseline for MS MARCO Passage Ranking
- Guide to reproducing the BM25 baseline for MS MARCO Document Ranking
- Guide to reproducing the multi-field BM25 baseline for MS MARCO Document Ranking from Elasticsearch
- Guide to reproducing Robust04 baselines for ad hoc retrieval
- Guide to reproducing TCT-ColBERT experiments
- Guide to reproducing DPR experiments
- Guide to reproducing ANCE experiments
- Guide to reproducing DistilBERT KD experiments
- Guide to reproducing DistilBERT Balanced Topic Aware Sampling experiments
- Guide to reproducing SBERT dense retrieval experiments
Pyserini provides baselines for a number of datasets.
- Baselines for KILT: a benchmark for Knowledge Intensive Language Tasks
- Baselines for TripClick: a large-scale dataset of click logs in the health domain
- Baselines (in Anserini) for the FEVER (Fact Extraction and VERification) dataset
- Guide to pre-built indexes
- Guide to interactive searching
- Guide to text classification with the 20Newsgroups dataset
- Guide to working with the COVID-19 Open Research Dataset (CORD-19)
- Guide to working with entity linking
- Guide to working with spaCy
- Usage of the Analyzer API
- Usage of the Index Reader API
- Usage of the Query Builder API
- Usage of the Collection API
- Direct Interaction via Pyjnius
Anserini is designed to work with JDK 11. There was a JRE path change above JDK 9 that breaks pyjnius 1.2.0, as documented in this issue, also reported in Anserini here and here. This issue was fixed with pyjnius 1.2.1 (released December 2019). The previous error was documented in this notebook and this notebook documents the fix.
- v0.12.0: May 5, 2021 [Release Notes]
- v0.11.0.0: February 18, 2021 [Release Notes]
- v0.10.1.0: January 8, 2021 [Release Notes]
- v0.10.0.1: December 2, 2020 [Release Notes]
- v0.10.0.0: November 26, 2020 [Release Notes]
- v0.9.4.0: June 26, 2020 [Release Notes]
- v0.9.3.1: June 11, 2020 [Release Notes]
- v0.9.3.0: May 27, 2020 [Release Notes]
- v0.9.2.0: May 15, 2020 [Release Notes]
- v0.9.1.0: May 6, 2020 [Release Notes]
- v0.9.0.0: April 18, 2020 [Release Notes]
- v0.8.1.0: March 22, 2020 [Release Notes]
- v0.8.0.0: March 12, 2020 [Release Notes]
- v0.7.2.0: January 25, 2020 [Release Notes]
- v0.7.1.0: January 9, 2020 [Release Notes]
- v0.7.0.0: December 13, 2019 [Release Notes]
- v0.6.0.0: November 2, 2019
With v0.11.0.0 and before, Pyserini versions adopted the convention of X.Y.Z.W, where X.Y.Z tracks the version of Anserini, and W is used to distinguish different releases on the Python end. Starting with Anserini v0.12.0, Anserini and Pyserini versions have become decoupled.