/beir

A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.

Primary LanguagePythonApache License 2.0Apache-2.0

PyPI made-with-python Maintenance Open In Colab Downloads Open Source Love svg1

🍻 What is it?

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.

For more information, checkout our publications:

🍻 Table Of Contents

🍻 Installation

Install via pip:

pip install beir

If you want to build from source, use:

$ git clone https://github.com/benchmarkir/beir.git
$ cd beir
$ pip install -e .

Tested with python versions 3.6 and 3.7

🍻 Features

  • Preprocess your own IR dataset or use one of the already-preprocessed 17 benchmark datasets
  • Wide settings included, covers diverse benchmarks useful for both academia and industry
  • Includes well-known retrieval architectures (lexical, dense, sparse and reranking-based)
  • Add and evaluate your own model in a easy framework using different state-of-the-art evaluation metrics

🍻 Leaderboard

Find below Google Sheets for BEIR Leaderboard. Unfortunately with Markdown the tables were not easy to read.

Leaderboard Link
Dense Retrieval Google Sheet
BM25 top-100 + CE Reranking Google Sheet

🍻 Course Material on IR

If you are new to Information Retrieval and wish to understand and learn more about classical or neural IR, we suggest you to look at the open-sourced courses below.

Course University Instructor Link Available
Training SOTA Neural Search Models Hugging Face Nils Reimers Link Video
BEIR: Benchmarking IR UKP Lab Nandan Thakur Link Video + Slides
Intro to Advanced IR TU Wien'21 Sebastian Hofstaetter Link Videos + Slides
CS224U NLU + IR Stanford'21 Omar Khattab Link Slides
Pretrained Transformers for Text Ranking: BERT and Beyond MPI, Waterloo'21 Andrew Yates, Rodrigo Nogueira, Jimmy Lin Link PDF
BoF Session on IR NAACL'21 Sean MacAvaney, Luca Soldaini Link Slides

🍻 Examples and Tutorials

To easily understand and get your hands dirty with BEIR, we invite you to try our tutorials out πŸš€ πŸš€

🍻 Google Colab

Name Link
How to evaluate pre-trained models on BEIR datasets Open In Colab

🍻 Lexical Retrieval (Evaluation)

Name Link
BM25 Retrieval with Elasticsearch evaluate_bm25.py
Anserini-BM25 (Pyserini) Retrieval with Docker evaluate_anserini_bm25.py
Multilingual BM25 Retrieval with Elasticsearch πŸ†• evaluate_multilingual_bm25.py

🍻 Dense Retrieval (Evaluation)

Name Link
Exact-search retrieval using (dense) Sentence-BERT evaluate_sbert.py
Exact-search retrieval using (dense) ANCE evaluate_ance.py
Exact-search retrieval using (dense) DPR evaluate_dpr.py
Exact-search retrieval using (dense) USE-QA evaluate_useqa.py
ANN and Exact-search using Faiss πŸ†• evaluate_faiss_dense.py
Retrieval using Binary Passage Retriver (BPR) πŸ†• evaluate_bpr.py
Dimension Reduction using PCA πŸ†• evaluate_dim_reduction.py

🍻 Sparse Retrieval (Evaluation)

Name Link
Hybrid sparse retrieval using SPARTA evaluate_sparta.py
Sparse retrieval using docT5query and Pyserini evaluate_anserini_docT5query.py
Sparse retrieval using docT5query (MultiGPU) and Pyserini πŸ†• evaluate_anserini_docT5query_parallel.py
Sparse retrieval using DeepCT and Pyserini πŸ†• evaluate_deepct.py

🍻 Reranking (Evaluation)

Name Link
Reranking top-100 BM25 results with SBERT CE evaluate_bm25_ce_reranking.py
Reranking top-100 BM25 results with Dense Retriever evaluate_bm25_sbert_reranking.py

🍻 Dense Retrieval (Training)

Name Link
Train SBERT with Inbatch negatives train_sbert.py
Train SBERT with BM25 hard negatives train_sbert_BM25_hardnegs.py
Train MSMARCO SBERT with BM25 Negatives train_msmarco_v2.py
Train (SOTA) MSMARCO SBERT with Mined Hard Negatives πŸ†• train_msmarco_v3.py
Train (SOTA) MSMARCO BPR with Mined Hard Negatives πŸ†• train_msmarco_v3_bpr.py
Train (SOTA) MSMARCO SBERT with Mined Hard Negatives (Margin-MSE) πŸ†• train_msmarco_v3_margin_MSE.py

🍻 Question Generation

Name Link
Synthetic Query Generation using T5-model query_gen.py
(GenQ) Synthetic QG using T5-model + fine-tuning SBERT query_gen_and_train.py
Synthetic Query Generation using Multiple GPU and T5 πŸ†• query_gen_multi_gpu.py

🍻 Benchmarking (Evaluation)

Name Link
Benchmark BM25 (Inference speed) benchmark_bm25.py
Benchmark Cross-Encoder Reranking (Inference speed) benchmark_bm25_ce_reranking.py
Benchmark Dense Retriever (Inference speed) benchmark_sbert.py

🍻 Quick Example

from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download scifact.zip dataset and unzip the dataset
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "datasets")
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

#### Load the SBERT model and retrieve using cosine-similarity
model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim") # or "dot" for dot-product
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

🍻 Download a preprocessed dataset

To load one of the already preprocessed datasets in your current directory as follows:

from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

This will download the scifact dataset under the datasets directory.

For other datasets, just use one of the datasets names, mention below.

🍻 Available Datasets

Command to generate md5hash using Terminal: md5hash filename.zip.

Dataset Website BEIR-Name Type Queries Corpus Rel D/Q Down-load md5
MSMARCO Homepage msmarco train
dev
test
6,980 8.84M 1.1 Link 444067daf65d982533ea17ebd59501e4
MSMARCO v2 Homepage msmarco-v2 train
dev1
dev2
4,552
4,702
138M Link ba6238b403f0b345683885cc9390fff5
TREC-COVID Homepage trec-covid test 50 171K 493.5 Link ce62140cb23feb9becf6270d0d1fe6d1
NFCorpus Homepage nfcorpus train
dev
test
323 3.6K 38.2 Link a89dba18a62ef92f7d323ec890a0d38d
BioASQ Homepage bioasq train
test
500 14.91M 8.05 No How to Reproduce?
NQ Homepage nq train
test
3,452 2.68M 1.2 Link d4d3d2e48787a744b6f6e691ff534307
HotpotQA Homepage hotpotqa train
dev
test
7,405 5.23M 2.0 Link f412724f78b0d91183a0e86805e16114
FiQA-2018 Homepage fiqa train
dev
test
648 57K 2.6 Link 17918ed23cd04fb15047f73e6c3bd9d9
Signal-1M(RT) Homepage signal1m test 97 2.86M 19.6 No How to Reproduce?
TREC-NEWS Homepage trec-news test 57 595K 19.6 No How to Reproduce?
ArguAna Homepage arguana test 1,406 8.67K 1.0 Link 8ad3e3c2a5867cdced806d6503f29b99
Touche-2020 Homepage webis-touche2020 test 49 382K 19.0 Link 46f650ba5a527fc69e0a6521c5a23563
CQADupstack Homepage cqadupstack test 13,145 457K 1.4 Link 4e41456d7df8ee7760a7f866133bda78
Quora Homepage quora dev
test
10,000 523K 1.6 Link 18fb154900ba42a600f84b839c173167
DBPedia Homepage dbpedia-entity dev
test
400 4.63M 38.2 Link c2a39eb420a3164af735795df012ac2c
SCIDOCS Homepage scidocs test 1,000 25K 4.9 Link 38121350fc3a4d2f48850f6aff52e4a9
FEVER Homepage fever train
dev
test
6,666 5.42M 1.2 Link 5a818580227bfb4b35bb6fa46d9b6c03
Climate-FEVER Homepage climate-fever test 1,535 5.42M 3.0 Link 8b66f0a9126c521bae2bde127b4dc99d
SciFact Homepage scifact train
test
300 5K 1.1 Link 5f7d1de60b170fc8027bb7898e2efca1
Robust04 Homepage robust04 test 249 528K 69.9 No How to Reproduce?

🍻 Multilingual Datasets

Language Dataset Website BEIR-Name Type Queries Corpus Rel D/Q Down-load md5
German GermanQuAD Homepage germanquad test 2,044 2.80M 1.0 Link 95a581c3162d10915a418609bcce851b
Arabic Mr.TyDI Homepage mrtydi/arabic train
dev
test
1,081 2.1M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Bengali Mr.TyDI Homepage mrtydi/bengali train
dev
test
111 304K 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Finnish Mr.TyDI Homepage mrtydi/finnish train
dev
test
1,254 1.9M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Indonesian Mr.TyDI Homepage mrtydi/indonesian train
dev
test
829 1.47M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Japanese Mr.TyDI Homepage mrtydi/japanese train
dev
test
720 7M 1.3 Link 17072d0e1610bd8461d962b8ac560fc5
Korean Mr.TyDI Homepage mrtydi/korean train
dev
test
421 1.5M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Russian Mr.TyDI Homepage mrtydi/russian train
dev
test
995 9.6M 1.2 Link 17072d0e1610bd8461d962b8ac560fc5
Swahili Mr.TyDI Homepage mrtydi/swahili train
dev
test
670 136K 1.1 Link 17072d0e1610bd8461d962b8ac560fc5
Telugu Mr.TyDI Homepage mrtydi/telugu train
dev
test
646 548K 1.0 Link 17072d0e1610bd8461d962b8ac560fc5
Thai Mr.TyDI Homepage mrtydi/thai train
dev
test
1,190 568K 1.1 Link 17072d0e1610bd8461d962b8ac560fc5

🍻 Translated (Multilingual) Datasets

Language Dataset Website BEIR-Name Type Queries Corpus Rel D/Q Down-load md5
Spanish mMARCO Homepage mmarco/spanish train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
French mMARCO Homepage mmarco/french train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Portuguese mMARCO Homepage mmarco/portuguese train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Italian mMARCO Homepage mmarco/italian train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Indonesian mMARCO Homepage mmarco/indonesian train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
German mMARCO Homepage mmarco/german train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Russian mMARCO Homepage mmarco/russian train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7
Chinese mMARCO Homepage mmarco/chinese train
dev
6,980 8.84M 1.1 Link b727dbec65315a76bceaff56ad77d2c7

Otherwise, you can load a custom preprocessed dataset in the following way:

from beir.datasets.data_loader import GenericDataLoader

corpus_path = "your_corpus_file.jsonl"
query_path = "your_query_file.jsonl"
qrels_path = "your_qrels_file.tsv"

corpus, queries, qrels = GenericDataLoader(
    corpus_file=corpus_path, 
    query_file=query_path, 
    qrels_file=qrels_path).load_custom()

Make sure that the dataset is in the following format:

  • corpus file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields _id with unique document identifier, title with document title (optional) and text with document paragraph or passage. For example: {"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
  • queries file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with two fields _id with unique query identifier and text with query text. For example: {"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
  • qrels file: a .tsv file (tab-seperated) that contains three columns, i.e. the query-id, corpus-id and score in this order. Keep 1st row as header. For example: q1 doc1 1

You can also skip the dataset loading part and provide directly corpus, queries and qrels in the following way:

corpus = {
    "doc1" : {
        "title": "Albert Einstein", 
        "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
                 one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
                 its influence on the philosophy of science. He is best known to the general public for his mass–energy \
                 equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
                 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
                 of the photoelectric effect', a pivotal step in the development of quantum theory."
        },
    "doc2" : {
        "title": "", # Keep title an empty string if not present
        "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
                 malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
                 with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
    },
}

queries = {
    "q1" : "Who developed the mass-energy equivalence formula?",
    "q2" : "Which beer is brewed with a large proportion of wheat?"
}

qrels = {
    "q1" : {"doc1": 1},
    "q2" : {"doc2": 1},
}

Disclaimer

Similar to Tensorflow datasets or HuggingFace's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!

If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!

🍻 Evaluate a model

We include different retrieval architectures and evaluate them all in a zero-shot setup.

Lexical Retrieval Evaluation using BM25 (Elasticsearch)

from beir.retrieval.search.lexical import BM25Search as BM25

hostname = "your-hostname" #localhost
index_name = "your-index-name" # scifact
initialize = True # True, will delete existing index with same name and reindex all documents
model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)

Sparse Retrieval using SPARTA

from beir.retrieval.search.sparse import SparseSearch
from beir.retrieval import models

model_path = "BeIR/sparta-msmarco-distilbert-base-v1"
sparse_model = SparseSearch(models.SPARTA(model_path), batch_size=128)

Dense Retrieval using SBERT, ANCE, USE-QA or DPR

from beir.retrieval import models
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim") # or "dot" for dot-product

Reranking using Cross-Encoder model

from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank

cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)

# Rerank top-100 results retrieved by BM25
rerank_results = reranker.rerank(corpus, queries, bm25_results, top_k=100)

🍻 Available Models

Name Implementation
BM25 (Robertson and Zaragoza, 2009) https://www.elastic.co/
Anserini (Yang et al., 2017) https://github.com/castorini/anserini
SBERT (Reimers and Gurevych, 2019) https://www.sbert.net/
ANCE (Xiong et al., 2020) https://github.com/microsoft/ANCE
DPR (Karpukhin et al., 2020) https://github.com/facebookresearch/DPR
USE-QA (Yang et al., 2020) https://tfhub.dev/google/universal-sentence-encoder-qa/3
SPARTA (Zhao et al., 2020) https://huggingface.co/BeIR
ColBERT (Khattab and Zaharia, 2020) https://github.com/stanford-futuredata/ColBERT

Disclaimer

If you use any one of the implementations, please make sure to include the correct citation.

If you implemented a model and wish to update any part of it, or do not want the model to be included, feel free to post an issue here or make a pull request!

If you implemented a model and wish to include your model in this library, feel free to post an issue here or make a pull request. Otherwise, if you want to evaluate the model on your own, see the following section.

🍻 Evaluate your own Model

Dense-Retriever Model (Dual-Encoder)

Mention your dual-encoder model in a class and have two functions: 1. encode_queries and 2. encode_corpus.

from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

class YourCustomDEModel:
    def __init__(self, model_path=None, **kwargs)
        self.model = None # ---> HERE Load your custom model
    
    # Write your own encoding query function (Returns: Query embeddings as numpy array)
    def encode_queries(self, queries: List[str], batch_size: int, **kwargs) -> np.ndarray:
        pass
    
    # Write your own encoding corpus function (Returns: Document embeddings as numpy array)  
    def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs) -> np.ndarray:
        pass

custom_model = DRES(YourCustomDEModel(model_path="your-custom-model-path"))

Re-ranking-based Model (Cross-Encoder)

Mention your cross-encoder model in a class and have a single function: predict

from beir.reranking import Rerank

class YourCustomCEModel:
    def __init__(self, model_path=None, **kwargs)
        self.model = None # ---> HERE Load your custom model
    
    # Write your own score function, which takes in query-document text pairs and returns the similarity scores
    def predict(self, sentences: List[Tuple[str,str]], batch_size: int, **kwags) -> List[float]:
        pass # return only the list of float scores

reranker = Rerank(YourCustomCEModel(model_path="your-custom-model-path"), batch_size=128)

🍻 Available Metrics

We evaluate our models using pytrec_eval and in future we can extend to include more retrieval-based metrics:

  • NDCG (NDCG@k)
  • MAP (MAP@k)
  • Recall (Recall@k)
  • Precision (P@k)

We also include custom-metrics now which can be used for evaluation, please refer here - evaluate_custom_metrics.py

  • MRR (MRR@k)
  • Capped Recall (R_cap@k)
  • Hole (Hole@k): % of top-k docs retrieved unseen by annotators
  • Top-K Accuracy (Accuracy@k): % of relevant docs present in top-k results

🍻 Citing & Authors

If you find this repository helpful, feel free to cite our publication BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models:

@inproceedings{
    thakur2021beir,
    title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
    author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
    year={2021},
    url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}

The main contributors of this repository are:

Contact person: Nandan Thakur, nandant@gmail.com

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.