beir: A Python repository from eneasmesquita

🍻 What is it?

BEIR is a heterogeneous benchmark containing diverse IR tasks. It also provides a common and easy framework for evaluation of your NLP-based retrieval models within the benchmark.

For more information, checkout our publications:

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models (NeurIPS 2021, Datasets and Benchmarks Track)

🍻 Table Of Contents

🍻 Installation

Install via pip:

pip install beir

If you want to build from source, use:

$ git clone https://github.com/benchmarkir/beir.git
$ cd beir
$ pip install -e .

Tested with python versions 3.6 and 3.7

🍻 Features

Preprocess your own IR dataset or use one of the already-preprocessed 17 benchmark datasets
Wide settings included, covers diverse benchmarks useful for both academia and industry
Includes well-known retrieval architectures (lexical, dense, sparse and reranking-based)
Add and evaluate your own model in a easy framework using different state-of-the-art evaluation metrics

🍻 Leaderboard

Find below Google Sheets for BEIR Leaderboard. Unfortunately with Markdown the tables were not easy to read.

Leaderboard	Link
Dense Retrieval	Google Sheet
BM25 top-100 + CE Reranking	Google Sheet

🍻 Course Material on IR

If you are new to Information Retrieval and wish to understand and learn more about classical or neural IR, we suggest you to look at the open-sourced courses below.

Course	University	Instructor	Link	Available
Training SOTA Neural Search Models	Hugging Face	Nils Reimers	Link	Video
BEIR: Benchmarking IR	UKP Lab	Nandan Thakur	Link	Video + Slides
Intro to Advanced IR	TU Wien'21	Sebastian Hofstaetter	Link	Videos + Slides
CS224U NLU + IR	Stanford'21	Omar Khattab	Link	Slides
Pretrained Transformers for Text Ranking: BERT and Beyond	MPI, Waterloo'21	Andrew Yates, Rodrigo Nogueira, Jimmy Lin	Link	PDF
BoF Session on IR	NAACL'21	Sean MacAvaney, Luca Soldaini	Link	Slides

🍻 Examples and Tutorials

To easily understand and get your hands dirty with BEIR, we invite you to try our tutorials out 🚀 🚀

🍻 Google Colab

Name	Link
How to evaluate pre-trained models on BEIR datasets

🍻 Lexical Retrieval (Evaluation)

Name	Link
BM25 Retrieval with Elasticsearch	evaluate_bm25.py
Anserini-BM25 (Pyserini) Retrieval with Docker	evaluate_anserini_bm25.py
Multilingual BM25 Retrieval with Elasticsearch 🆕	evaluate_multilingual_bm25.py

🍻 Dense Retrieval (Evaluation)

Name	Link
Exact-search retrieval using (dense) Sentence-BERT	evaluate_sbert.py
Exact-search retrieval using (dense) ANCE	evaluate_ance.py
Exact-search retrieval using (dense) DPR	evaluate_dpr.py
Exact-search retrieval using (dense) USE-QA	evaluate_useqa.py
ANN and Exact-search using Faiss 🆕	evaluate_faiss_dense.py
Retrieval using Binary Passage Retriver (BPR) 🆕	evaluate_bpr.py
Dimension Reduction using PCA 🆕	evaluate_dim_reduction.py

🍻 Sparse Retrieval (Evaluation)

Name	Link
Hybrid sparse retrieval using SPARTA	evaluate_sparta.py
Sparse retrieval using docT5query and Pyserini	evaluate_anserini_docT5query.py
Sparse retrieval using docT5query (MultiGPU) and Pyserini 🆕	evaluate_anserini_docT5query_parallel.py
Sparse retrieval using DeepCT and Pyserini 🆕	evaluate_deepct.py

🍻 Reranking (Evaluation)

Name	Link
Reranking top-100 BM25 results with SBERT CE	evaluate_bm25_ce_reranking.py
Reranking top-100 BM25 results with Dense Retriever	evaluate_bm25_sbert_reranking.py

🍻 Dense Retrieval (Training)

Name	Link
Train SBERT with Inbatch negatives	train_sbert.py
Train SBERT with BM25 hard negatives	train_sbert_BM25_hardnegs.py
Train MSMARCO SBERT with BM25 Negatives	train_msmarco_v2.py
Train (SOTA) MSMARCO SBERT with Mined Hard Negatives 🆕	train_msmarco_v3.py
Train (SOTA) MSMARCO BPR with Mined Hard Negatives 🆕	train_msmarco_v3_bpr.py
Train (SOTA) MSMARCO SBERT with Mined Hard Negatives (Margin-MSE) 🆕	train_msmarco_v3_margin_MSE.py

🍻 Question Generation

Name	Link
Synthetic Query Generation using T5-model	query_gen.py
(GenQ) Synthetic QG using T5-model + fine-tuning SBERT	query_gen_and_train.py
Synthetic Query Generation using Multiple GPU and T5 🆕	query_gen_multi_gpu.py

🍻 Benchmarking (Evaluation)

Name	Link
Benchmark BM25 (Inference speed)	benchmark_bm25.py
Benchmark Cross-Encoder Reranking (Inference speed)	benchmark_bm25_ce_reranking.py
Benchmark Dense Retriever (Inference speed)	benchmark_sbert.py

🍻 Quick Example

from beir import util, LoggingHandler
from beir.retrieval import models
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

import logging
import pathlib, os

#### Just some code to print debug information to stdout
logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])
#### /print debug information to stdout

#### Download scifact.zip dataset and unzip the dataset
dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
out_dir = os.path.join(pathlib.Path(__file__).parent.absolute(), "datasets")
data_path = util.download_and_unzip(url, out_dir)

#### Provide the data_path where scifact has been downloaded and unzipped
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

#### Load the SBERT model and retrieve using cosine-similarity
model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim") # or "dot" for dot-product
results = retriever.retrieve(corpus, queries)

#### Evaluate your model with NDCG@k, MAP@K, Recall@K and Precision@K  where k = [1,3,5,10,100,1000] 
ndcg, _map, recall, precision = retriever.evaluate(qrels, results, retriever.k_values)

🍻 Download a preprocessed dataset

To load one of the already preprocessed datasets in your current directory as follows:

from beir import util
from beir.datasets.data_loader import GenericDataLoader

dataset = "scifact"
url = "https://public.ukp.informatik.tu-darmstadt.de/thakur/BEIR/datasets/{}.zip".format(dataset)
data_path = util.download_and_unzip(url, "datasets")
corpus, queries, qrels = GenericDataLoader(data_folder=data_path).load(split="test")

This will download the scifact dataset under the datasets directory.

For other datasets, just use one of the datasets names, mention below.

🍻 Available Datasets

Command to generate md5hash using Terminal: md5hash filename.zip.

Dataset	Website	BEIR-Name	Type	Queries	Corpus	Rel D/Q	Down-load	md5
MSMARCO	Homepage	`msmarco`	`train` `dev` `test`	6,980	8.84M	1.1	Link	`444067daf65d982533ea17ebd59501e4`
MSMARCO v2	Homepage	`msmarco-v2`	`train` `dev1` `dev2`	4,552 4,702	138M		Link	`ba6238b403f0b345683885cc9390fff5`
TREC-COVID	Homepage	`trec-covid`	`test`	50	171K	493.5	Link	`ce62140cb23feb9becf6270d0d1fe6d1`
NFCorpus	Homepage	`nfcorpus`	`train` `dev` `test`	323	3.6K	38.2	Link	`a89dba18a62ef92f7d323ec890a0d38d`
BioASQ	Homepage	`bioasq`	`train` `test`	500	14.91M	8.05	No	How to Reproduce?
NQ	Homepage	`nq`	`train` `test`	3,452	2.68M	1.2	Link	`d4d3d2e48787a744b6f6e691ff534307`
HotpotQA	Homepage	`hotpotqa`	`train` `dev` `test`	7,405	5.23M	2.0	Link	`f412724f78b0d91183a0e86805e16114`
FiQA-2018	Homepage	`fiqa`	`train` `dev` `test`	648	57K	2.6	Link	`17918ed23cd04fb15047f73e6c3bd9d9`
Signal-1M(RT)	Homepage	`signal1m`	`test`	97	2.86M	19.6	No	How to Reproduce?
TREC-NEWS	Homepage	`trec-news`	`test`	57	595K	19.6	No	How to Reproduce?
ArguAna	Homepage	`arguana`	`test`	1,406	8.67K	1.0	Link	`8ad3e3c2a5867cdced806d6503f29b99`
Touche-2020	Homepage	`webis-touche2020`	`test`	49	382K	19.0	Link	`46f650ba5a527fc69e0a6521c5a23563`
CQADupstack	Homepage	`cqadupstack`	`test`	13,145	457K	1.4	Link	`4e41456d7df8ee7760a7f866133bda78`
Quora	Homepage	`quora`	`dev` `test`	10,000	523K	1.6	Link	`18fb154900ba42a600f84b839c173167`
DBPedia	Homepage	`dbpedia-entity`	`dev` `test`	400	4.63M	38.2	Link	`c2a39eb420a3164af735795df012ac2c`
SCIDOCS	Homepage	`scidocs`	`test`	1,000	25K	4.9	Link	`38121350fc3a4d2f48850f6aff52e4a9`
FEVER	Homepage	`fever`	`train` `dev` `test`	6,666	5.42M	1.2	Link	`5a818580227bfb4b35bb6fa46d9b6c03`
Climate-FEVER	Homepage	`climate-fever`	`test`	1,535	5.42M	3.0	Link	`8b66f0a9126c521bae2bde127b4dc99d`
SciFact	Homepage	`scifact`	`train` `test`	300	5K	1.1	Link	`5f7d1de60b170fc8027bb7898e2efca1`
Robust04	Homepage	`robust04`	`test`	249	528K	69.9	No	How to Reproduce?

🍻 Multilingual Datasets

Language	Dataset	Website	BEIR-Name	Type	Queries	Corpus	Rel D/Q	Down-load	md5
German	GermanQuAD	Homepage	`germanquad`	`test`	2,044	2.80M	1.0	Link	`95a581c3162d10915a418609bcce851b`
Arabic	Mr.TyDI	Homepage	`mrtydi/arabic`	`train` `dev` `test`	1,081	2.1M	1.2	Link	`17072d0e1610bd8461d962b8ac560fc5`
Bengali	Mr.TyDI	Homepage	`mrtydi/bengali`	`train` `dev` `test`	111	304K	1.2	Link	`17072d0e1610bd8461d962b8ac560fc5`
Finnish	Mr.TyDI	Homepage	`mrtydi/finnish`	`train` `dev` `test`	1,254	1.9M	1.2	Link	`17072d0e1610bd8461d962b8ac560fc5`
Indonesian	Mr.TyDI	Homepage	`mrtydi/indonesian`	`train` `dev` `test`	829	1.47M	1.2	Link	`17072d0e1610bd8461d962b8ac560fc5`
Japanese	Mr.TyDI	Homepage	`mrtydi/japanese`	`train` `dev` `test`	720	7M	1.3	Link	`17072d0e1610bd8461d962b8ac560fc5`
Korean	Mr.TyDI	Homepage	`mrtydi/korean`	`train` `dev` `test`	421	1.5M	1.2	Link	`17072d0e1610bd8461d962b8ac560fc5`
Russian	Mr.TyDI	Homepage	`mrtydi/russian`	`train` `dev` `test`	995	9.6M	1.2	Link	`17072d0e1610bd8461d962b8ac560fc5`
Swahili	Mr.TyDI	Homepage	`mrtydi/swahili`	`train` `dev` `test`	670	136K	1.1	Link	`17072d0e1610bd8461d962b8ac560fc5`
Telugu	Mr.TyDI	Homepage	`mrtydi/telugu`	`train` `dev` `test`	646	548K	1.0	Link	`17072d0e1610bd8461d962b8ac560fc5`
Thai	Mr.TyDI	Homepage	`mrtydi/thai`	`train` `dev` `test`	1,190	568K	1.1	Link	`17072d0e1610bd8461d962b8ac560fc5`

🍻 Translated (Multilingual) Datasets

Language	Dataset	Website	BEIR-Name	Type	Queries	Corpus	Rel D/Q	Down-load	md5
Spanish	mMARCO	Homepage	`mmarco/spanish`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
French	mMARCO	Homepage	`mmarco/french`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
Portuguese	mMARCO	Homepage	`mmarco/portuguese`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
Italian	mMARCO	Homepage	`mmarco/italian`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
Indonesian	mMARCO	Homepage	`mmarco/indonesian`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
German	mMARCO	Homepage	`mmarco/german`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
Russian	mMARCO	Homepage	`mmarco/russian`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`
Chinese	mMARCO	Homepage	`mmarco/chinese`	`train` `dev`	6,980	8.84M	1.1	Link	`b727dbec65315a76bceaff56ad77d2c7`

Otherwise, you can load a custom preprocessed dataset in the following way:

from beir.datasets.data_loader import GenericDataLoader

corpus_path = "your_corpus_file.jsonl"
query_path = "your_query_file.jsonl"
qrels_path = "your_qrels_file.tsv"

corpus, queries, qrels = GenericDataLoader(
    corpus_file=corpus_path, 
    query_file=query_path, 
    qrels_file=qrels_path).load_custom()

Make sure that the dataset is in the following format:

corpus file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with three fields _id with unique document identifier, title with document title (optional) and text with document paragraph or passage. For example: {"_id": "doc1", "title": "Albert Einstein", "text": "Albert Einstein was a German-born...."}
queries file: a .jsonl file (jsonlines) that contains a list of dictionaries, each with two fields _id with unique query identifier and text with query text. For example: {"_id": "q1", "text": "Who developed the mass-energy equivalence formula?"}
qrels file: a .tsv file (tab-seperated) that contains three columns, i.e. the query-id, corpus-id and score in this order. Keep 1st row as header. For example: q1 doc1 1

You can also skip the dataset loading part and provide directly corpus, queries and qrels in the following way:

corpus = {
    "doc1" : {
        "title": "Albert Einstein", 
        "text": "Albert Einstein was a German-born theoretical physicist. who developed the theory of relativity, \
                 one of the two pillars of modern physics (alongside quantum mechanics). His work is also known for \
                 its influence on the philosophy of science. He is best known to the general public for his mass–energy \
                 equivalence formula E = mc2, which has been dubbed 'the world's most famous equation'. He received the 1921 \
                 Nobel Prize in Physics 'for his services to theoretical physics, and especially for his discovery of the law \
                 of the photoelectric effect', a pivotal step in the development of quantum theory."
        },
    "doc2" : {
        "title": "", # Keep title an empty string if not present
        "text": "Wheat beer is a top-fermented beer which is brewed with a large proportion of wheat relative to the amount of \
                 malted barley. The two main varieties are German Weißbier and Belgian witbier; other types include Lambic (made\
                 with wild yeast), Berliner Weisse (a cloudy, sour beer), and Gose (a sour, salty beer)."
    },
}

queries = {
    "q1" : "Who developed the mass-energy equivalence formula?",
    "q2" : "Which beer is brewed with a large proportion of wheat?"
}

qrels = {
    "q1" : {"doc1": 1},
    "q2" : {"doc2": 1},
}

Disclaimer

Similar to Tensorflow datasets or HuggingFace's datasets library, we just downloaded and prepared public datasets. We only distribute these datasets in a specific format, but we do not vouch for their quality or fairness, or claim that you have license to use the dataset. It remains the user's responsibility to determine whether you as a user have permission to use the dataset under the dataset's license and to cite the right owner of the dataset.

If you're a dataset owner and wish to update any part of it, or do not want your dataset to be included in this library, feel free to post an issue here or make a pull request!

If you're a dataset owner and wish to include your dataset or model in this library, feel free to post an issue here or make a pull request!

🍻 Evaluate a model

We include different retrieval architectures and evaluate them all in a zero-shot setup.

Lexical Retrieval Evaluation using BM25 (Elasticsearch)

from beir.retrieval.search.lexical import BM25Search as BM25

hostname = "your-hostname" #localhost
index_name = "your-index-name" # scifact
initialize = True # True, will delete existing index with same name and reindex all documents
model = BM25(index_name=index_name, hostname=hostname, initialize=initialize)

Sparse Retrieval using SPARTA

from beir.retrieval.search.sparse import SparseSearch
from beir.retrieval import models

model_path = "BeIR/sparta-msmarco-distilbert-base-v1"
sparse_model = SparseSearch(models.SPARTA(model_path), batch_size=128)

Dense Retrieval using SBERT, ANCE, USE-QA or DPR

from beir.retrieval import models
from beir.retrieval.evaluation import EvaluateRetrieval
from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

model = DRES(models.SentenceBERT("msmarco-distilbert-base-v3"), batch_size=16)
retriever = EvaluateRetrieval(model, score_function="cos_sim") # or "dot" for dot-product

Reranking using Cross-Encoder model

from beir.reranking.models import CrossEncoder
from beir.reranking import Rerank

cross_encoder_model = CrossEncoder('cross-encoder/ms-marco-electra-base')
reranker = Rerank(cross_encoder_model, batch_size=128)

# Rerank top-100 results retrieved by BM25
rerank_results = reranker.rerank(corpus, queries, bm25_results, top_k=100)

🍻 Available Models

Name	Implementation
BM25 (Robertson and Zaragoza, 2009)	https://www.elastic.co/
Anserini (Yang et al., 2017)	https://github.com/castorini/anserini
SBERT (Reimers and Gurevych, 2019)	https://www.sbert.net/
ANCE (Xiong et al., 2020)	https://github.com/microsoft/ANCE
DPR (Karpukhin et al., 2020)	https://github.com/facebookresearch/DPR
USE-QA (Yang et al., 2020)	https://tfhub.dev/google/universal-sentence-encoder-qa/3
SPARTA (Zhao et al., 2020)	https://huggingface.co/BeIR
ColBERT (Khattab and Zaharia, 2020)	https://github.com/stanford-futuredata/ColBERT

Disclaimer

If you use any one of the implementations, please make sure to include the correct citation.

If you implemented a model and wish to update any part of it, or do not want the model to be included, feel free to post an issue here or make a pull request!

If you implemented a model and wish to include your model in this library, feel free to post an issue here or make a pull request. Otherwise, if you want to evaluate the model on your own, see the following section.

🍻 Evaluate your own Model

Dense-Retriever Model (Dual-Encoder)

Mention your dual-encoder model in a class and have two functions: 1. encode_queries and 2. encode_corpus.

from beir.retrieval.search.dense import DenseRetrievalExactSearch as DRES

class YourCustomDEModel:
    def __init__(self, model_path=None, **kwargs)
        self.model = None # ---> HERE Load your custom model
    
    # Write your own encoding query function (Returns: Query embeddings as numpy array)
    def encode_queries(self, queries: List[str], batch_size: int, **kwargs) -> np.ndarray:
        pass
    
    # Write your own encoding corpus function (Returns: Document embeddings as numpy array)  
    def encode_corpus(self, corpus: List[Dict[str, str]], batch_size: int, **kwargs) -> np.ndarray:
        pass

custom_model = DRES(YourCustomDEModel(model_path="your-custom-model-path"))

Re-ranking-based Model (Cross-Encoder)

Mention your cross-encoder model in a class and have a single function: predict

from beir.reranking import Rerank

class YourCustomCEModel:
    def __init__(self, model_path=None, **kwargs)
        self.model = None # ---> HERE Load your custom model
    
    # Write your own score function, which takes in query-document text pairs and returns the similarity scores
    def predict(self, sentences: List[Tuple[str,str]], batch_size: int, **kwags) -> List[float]:
        pass # return only the list of float scores

reranker = Rerank(YourCustomCEModel(model_path="your-custom-model-path"), batch_size=128)

🍻 Available Metrics

We evaluate our models using pytrec_eval and in future we can extend to include more retrieval-based metrics:

NDCG (NDCG@k)
MAP (MAP@k)
Recall (Recall@k)
Precision (P@k)

We also include custom-metrics now which can be used for evaluation, please refer here - evaluate_custom_metrics.py

MRR (MRR@k)
Capped Recall (R_cap@k)
Hole (Hole@k): % of top-k docs retrieved unseen by annotators
Top-K Accuracy (Accuracy@k): % of relevant docs present in top-k results

🍻 Citing & Authors

If you find this repository helpful, feel free to cite our publication BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models:

@inproceedings{
    thakur2021beir,
    title={{BEIR}: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models},
    author={Nandan Thakur and Nils Reimers and Andreas R{\"u}ckl{\'e} and Abhishek Srivastava and Iryna Gurevych},
    booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
    year={2021},
    url={https://openreview.net/forum?id=wCu6T5xFjeJ}
}

The main contributors of this repository are:

Nandan Thakur, Personal Website: nandan-thakur.com

Contact person: Nandan Thakur, nandant@gmail.com

https://www.ukp.tu-darmstadt.de/

Don't hesitate to send us an e-mail or report an issue, if something is broken (and it shouldn't be) or if you have further questions.

This repository contains experimental software and is published for the sole purpose of giving additional background details on the respective publication.

eneasmesquita/beir

🍻 What is it?

🍻 Table Of Contents

🍻 Installation

🍻 Features

🍻 Leaderboard

🍻 Course Material on IR

🍻 Examples and Tutorials

🍻 Google Colab

🍻 Lexical Retrieval (Evaluation)

🍻 Dense Retrieval (Evaluation)

🍻 Sparse Retrieval (Evaluation)

🍻 Reranking (Evaluation)

🍻 Dense Retrieval (Training)

🍻 Question Generation

🍻 Benchmarking (Evaluation)

🍻 Quick Example

🍻 Download a preprocessed dataset

🍻 Available Datasets

🍻 Multilingual Datasets

🍻 Translated (Multilingual) Datasets

Disclaimer

🍻 Evaluate a model

Lexical Retrieval Evaluation using BM25 (Elasticsearch)

Sparse Retrieval using SPARTA

Dense Retrieval using SBERT, ANCE, USE-QA or DPR

Reranking using Cross-Encoder model

🍻 Available Models

Disclaimer

🍻 Evaluate your own Model

Dense-Retriever Model (Dual-Encoder)

Re-ranking-based Model (Cross-Encoder)

🍻 Available Metrics

🍻 Citing & Authors