retriv is a fast search engine implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to index and search your document collection and allows you to automatically tune the underling retrieval model, BM25.
retriv implements BM25 as a retrieval model. Alternatives will probably be added in the future.
In addition to the standard search functionality, retriv provides two additional search methods: msearch and bsearch.
- msearch allows computing the results for multiple queries at once, leveraging the automatic parallelization features offered by Numba.
- bsearch is similar to msearch but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is very useful for pre-computing BM25 results for hundred of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of Neural Models for Information Retrieval.
retriv offers an automatic tuning functionality that allows you to tune BM25's parameters with a single function call. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one.
Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:
- snowball (default)
The following languages are supported by Snowball Stemmer: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
To select your preferred language simply use<language>
. - arlstem (Arabic)
- arlstem2 (Arabic)
- cistem (German)
- isri (Arabic)
- krovetz (English)
- lancaster (English)
- porter (English)
- rslp (Portuguese)
Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:
retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.
retriv provides automatic spell correction through Hunspell for 92 languages. Please, follow the link and choose your preferred language (e.g., Italian → "dictionary-it" → use "it"). For some languages you can directly pass their names: Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Portuguese, Romanian, Russian, Spanish, and Swedish.
NOTE: Automatic spell correction is disabled by default. It can introduce artifacts, degrading retrieval performances when documents are free from misspellings. If possible, check whether it can improve retrieval performances for your specific document collection.
pip install retriv
from retriv import SearchEngine
collection = [
{"id": "doc_1", "text": "Generals gathered in their masses"},
{"id": "doc_2", "text": "Just like witches at black masses"},
{"id": "doc_3", "text": "Evil minds that plot destruction"},
{"id": "doc_4", "text": "Sorcerer of death's construction"},
]
se = SearchEngine("new-index")
se.index(collection)
se.search("witches masses")
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
You can index a document collection from a JSONl, CSV, or TSV file.
CSV and TSV files must have a header.
File kind is automatically inferred.
Use the callback
parameter to pass a function for converting your documents in the format supported by retriv on the fly.
Indexes are automatically saved.
This is the preferred way of creating indexes as it has a low memory footprint.
from retriv import SearchEngine
se = SearchEngine("new-index")
se.index_file(
path="path/to/collection", # File kind is automatically inferred
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None
"id": doc["id"],
"text": doc["title"] + "\n" + doc["body"],
)
se = SearchEngine("new-index")
is equivalent to:
se = SearchEngine(
index_name="new-index", # Default value
min_df=1, # Min doc-frequency. Defaults to 1.
tokenizer="whitespace", # Default value
stemmer="english", # Default value (Snowball English)
stopwords="english", # Default value
spell_corrector=None, # Default value
do_lowercasing=True, # Default value
do_ampersand_normalization=True, # Default value
do_special_chars_normalization=True, # Default value
do_acronyms_normalization=True, # Default value
do_punctuation_removal=True, # Default value
)
collection = [
{"id": "doc_1", "title": "...", "body": "..."},
{"id": "doc_2", "title": "...", "body": "..."},
{"id": "doc_3", "title": "...", "body": "..."},
{"id": "doc_4", "title": "...", "body": "..."},
]
se = SearchEngine(...)
se.index(
collection,
show_progress=True, # Default value
callback=lambda doc: { # Callback defaults to None
"id": doc["id"],
"text": doc["title"] + "\n" + doc["body"],
)
)
from retriv import SearchEngine
se = SearchEngine.load("index-name")
SearchEngine.delete("index-name")
se.search(
query="witches masses",
return_docs=True, # Default value
cutoff=100, # Default value, number of results to return
)
Output:
[
{
"id": "doc_2",
"text": "Just like witches at black masses",
"score": 1.7536403
},
{
"id": "doc_1",
"text": "Generals gathered in their masses",
"score": 0.6931472
}
]
se.msearch(
queries=[{"id": "q_1", "text": "witches masses"}, ...],
cutoff=100, # Default value, number of results
)
Output:
{
"q_1": {
"doc_2": 1.7536403,
"doc_1": 0.6931472
},
...
}
Use the AutoTune function to tune BM25 parameters w.r.t. your document collection and queries.
All metrics supported by ranx are supported by the autotune
function.
se.autotune(
queries=[{ "q_id": "q_1", "text": "...", ... }], # Train queries
qrels=[{ "q_1": { "doc_1": 1, ... }, ... }], # Train qrels
metric="ndcg", # Default value, metric to maximize
n_trials=100, # Default value, number of trials
cutoff=100, # Default value, number of results
)
At the of the process, the best parameter configuration is automatically applied to the SearchEngine
instance and saved to disk.
You can see what the configuration is by printing se.hyperparams
.
We performed a speed test, comparing retriv to rank_bm25, a popular BM25 implementation in Python, and pyserini, a Python binding to the Lucene search engine.
We relied on the MSMARCO Passage dataset to collect documents and queries. Specifically, we used the original document collection and three sub-samples of it, accounting for 1k, 100k, and 1M documents, respectively, and sampled 1k queries from the original ones. We computed the top-100 results with each library (if possible). Results are reported below. Best results are highlighted in boldface.
Library | Collection Size | Elapsed Time | Avg. Query Time | Throughput (q/s) |
---|---|---|---|---|
rank_bm25 | 1,000 | 646ms | 6.5ms | 1548/s |
pyserini | 1,000 | 1,438ms | 1.4ms | 695/s |
retriv | 1,000 | 140ms | 0.1ms | 7143/s |
retriv (multi-search) | 1,000 | 134ms | 0.1ms | 7463/s |
rank_bm25 | 100,000 | 106,000ms | 1060ms | 1/s |
pyserini | 100,000 | 2,532ms | 2.5ms | 395/s |
retriv | 100,000 | 314ms | 0.3ms | 3185/s |
retriv (multi-search) | 100,000 | 256ms | 0.3ms | 3906/s |
rank_bm25 | 1,000,000 | N/A | N/A | N/A |
pyserini | 1,000,000 | 4,060ms | 4.1ms | 246/s |
retriv | 1,000,000 | 1,018ms | 1.0ms | 982/s |
retriv (multi-search) | 1,000,000 | 503ms | 0.5ms | 1988/s |
rank_bm25 | 8,841,823 | N/A | N/A | N/A |
pyserini | 8,841,823 | 12,245ms | 12.2ms | 82/s |
retriv | 8,841,823 | 10,763ms | 10.8ms | 93/s |
retriv (multi-search) | 8,841,823 | 4,476ms | 4.4ms | 227/s |
Would you like to see other features implemented? Please, open a feature request.
Would you like to contribute? Please, drop me an e-mail.
retriv is an open-sourced software licensed under the MIT license.