/retriv

⚡️Blazing-Fast Python Search Engine 🐍

Primary LanguagePythonMIT LicenseMIT

PyPI version License: MIT

⚡️ Introduction

retriv is a fast search engine implemented in Python, leveraging Numba for high-speed vector operations and automatic parallelization. It offers a user-friendly interface to index and search your document collection and allows you to automatically tune the underling retrieval model, BM25.

How fast is your retriv?

✨ Features

Retrieval Models

retriv implements BM25 as a retrieval model. Alternatives will probably be added in the future.

Multi-search & Batch-search

In addition to the standard search functionality, retriv provides two additional search methods: msearch and bsearch.

  • msearch allows computing the results for multiple queries at once, leveraging the automatic parallelization features offered by Numba.
  • bsearch is similar to msearch but automatically generates batches of queries to evaluate and allows dynamic writing of the search results to disk in JSONl format. bsearch is very useful for pre-computing BM25 results for hundred of thousands or even millions of queries without hogging your RAM. Pre-computed results can be leveraged for negative sampling during the training of Neural Models for Information Retrieval.

AutoTune

retriv offers an automatic tuning functionality that allows you to tune BM25's parameters with a single function call. Under the hood, retriv leverages Optuna, a hyperparameter optimization framework, and ranx, an Information Retrieval evaluation library, to test several parameter configurations for BM25 and choose the best one.

Stemmers

Stemmers reduce words to their word stem, base or root form.
retriv supports the following stemmers:

  • snowball (default)
    The following languages are supported by Snowball Stemmer: Arabic, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Lithuanian, Nepali, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish, Tamil, Turkish.
    To select your preferred language simply use <language> .
  • arlstem (Arabic)
  • arlstem2 (Arabic)
  • cistem (German)
  • isri (Arabic)
  • krovetz (English)
  • lancaster (English)
  • porter (English)
  • rslp (Portuguese)

Tokenizers

Tokenizers divide a string into smaller units, such as words.
retriv supports the following tokenizers:

Stop-word Lists

retriv supports stop-word lists for the following languages: Arabic, Azerbaijani, Basque, Bengali, Catalan, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hinglish, Hungarian, Indonesian, Italian, Kazakh, Nepali, Norwegian, Portuguese, Romanian, Russian, Slovene, Spanish, Swedish, Tajik, and Turkish.

Automatic Spell Correction

retriv provides automatic spell correction through Hunspell for 92 languages. Please, follow the link and choose your preferred language (e.g., Italian → "dictionary-it" → use "it"). For some languages you can directly pass their names: Danish, Dutch, English, Finnish, French, German, Greek, Hungarian, Italian, Portuguese, Romanian, Russian, Spanish, and Swedish.

NOTE: Automatic spell correction is disabled by default. It can introduce artifacts, degrading retrieval performances when documents are free from misspellings. If possible, check whether it can improve retrieval performances for your specific document collection.

🔌 Installation

pip install retriv

💡 Usage

Minimal Working Example

from retriv import SearchEngine

collection = [
  {"id": "doc_1", "text": "Generals gathered in their masses"},
  {"id": "doc_2", "text": "Just like witches at black masses"},
  {"id": "doc_3", "text": "Evil minds that plot destruction"},
  {"id": "doc_4", "text": "Sorcerer of death's construction"},
]

se = SearchEngine("new-index")
se.index(collection)

se.search("witches masses")

Output:

[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

Create index from file

You can index a document collection from a JSONl, CSV, or TSV file. CSV and TSV files must have a header. File kind is automatically inferred. Use the callback parameter to pass a function for converting your documents in the format supported by retriv on the fly. Indexes are automatically saved. This is the preferred way of creating indexes as it has a low memory footprint.

from retriv import SearchEngine

se = SearchEngine("new-index")

se.index_file(
  path="path/to/collection",  # File kind is automatically inferred
  show_progress=True,         # Default value
  callback=lambda doc: {      # Callback defaults to None
    "id": doc["id"],
    "text": doc["title"] + "\n" + doc["body"],          
  )

se = SearchEngine("new-index") is equivalent to:

se = SearchEngine(
  index_name="new-index",               # Default value
  min_df=1,                             # Min doc-frequency. Defaults to 1.
  tokenizer="whitespace",               # Default value
  stemmer="english",                    # Default value (Snowball English)
  stopwords="english",                  # Default value
  spell_corrector=None,                 # Default value
  do_lowercasing=True,                  # Default value
  do_ampersand_normalization=True,      # Default value
  do_special_chars_normalization=True,  # Default value
  do_acronyms_normalization=True,       # Default value
  do_punctuation_removal=True,          # Default value
)

Create index from list

collection = [
  {"id": "doc_1", "title": "...", "body": "..."},
  {"id": "doc_2", "title": "...", "body": "..."},
  {"id": "doc_3", "title": "...", "body": "..."},
  {"id": "doc_4", "title": "...", "body": "..."},
]

se = SearchEngine(...)

se.index(
  collection,
  show_progress=True,         # Default value
  callback=lambda doc: {      # Callback defaults to None
    "id": doc["id"],
    "text": doc["title"] + "\n" + doc["body"],          
  )
)

Load / Delete index

from retriv import SearchEngine

se = SearchEngine.load("index-name")

SearchEngine.delete("index-name")

Search

se.search(
  query="witches masses",
  return_docs=True,  # Default value
  cutoff=100,        # Default value, number of results to return
)

Output:

[
  {
    "id": "doc_2",
    "text": "Just like witches at black masses",
    "score": 1.7536403
  },
  {
    "id": "doc_1",
    "text": "Generals gathered in their masses",
    "score": 0.6931472
  }
]

Multi-Search

se.msearch(
  queries=[{"id": "q_1", "text": "witches masses"}, ...],
  cutoff=100,  # Default value, number of results
)

Output:

{
  "q_1": {
    "doc_2": 1.7536403,
    "doc_1": 0.6931472
  },
  ...
}

AutoTune

Use the AutoTune function to tune BM25 parameters w.r.t. your document collection and queries. All metrics supported by ranx are supported by the autotune function.

se.autotune(
    queries=[{ "q_id": "q_1", "text": "...", ... }],  # Train queries
    qrels=[{ "q_1": { "doc_1": 1, ... }, ... }],      # Train qrels
    metric="ndcg",  # Default value, metric to maximize
    n_trials=100,   # Default value, number of trials
    cutoff=100,     # Default value, number of results
)

At the of the process, the best parameter configuration is automatically applied to the SearchEngine instance and saved to disk. You can see what the configuration is by printing se.hyperparams.

Speed Comparison

We performed a speed test, comparing retriv to rank_bm25, a popular BM25 implementation in Python, and pyserini, a Python binding to the Lucene search engine.

We relied on the MSMARCO Passage dataset to collect documents and queries. Specifically, we used the original document collection and three sub-samples of it, accounting for 1k, 100k, and 1M documents, respectively, and sampled 1k queries from the original ones. We computed the top-100 results with each library (if possible). Results are reported below. Best results are highlighted in boldface.

Library Collection Size Elapsed Time Avg. Query Time Throughput (q/s)
rank_bm25 1,000 646ms 6.5ms 1548/s
pyserini 1,000 1,438ms 1.4ms 695/s
retriv 1,000 140ms 0.1ms 7143/s
retriv (multi-search) 1,000 134ms 0.1ms 7463/s
rank_bm25 100,000 106,000ms 1060ms 1/s
pyserini 100,000 2,532ms 2.5ms 395/s
retriv 100,000 314ms 0.3ms 3185/s
retriv (multi-search) 100,000 256ms 0.3ms 3906/s
rank_bm25 1,000,000 N/A N/A N/A
pyserini 1,000,000 4,060ms 4.1ms 246/s
retriv 1,000,000 1,018ms 1.0ms 982/s
retriv (multi-search) 1,000,000 503ms 0.5ms 1988/s
rank_bm25 8,841,823 N/A N/A N/A
pyserini 8,841,823 12,245ms 12.2ms 82/s
retriv 8,841,823 10,763ms 10.8ms 93/s
retriv (multi-search) 8,841,823 4,476ms 4.4ms 227/s

🎁 Feature Requests

Would you like to see other features implemented? Please, open a feature request.

🤘 Want to contribute?

Would you like to contribute? Please, drop me an e-mail.

📄 License

retriv is an open-sourced software licensed under the MIT license.