/mteb

Massive Text Embedding Benchmark

Primary LanguagePythonApache License 2.0Apache-2.0

Massive Text Embedding Benchmark

GitHub release Build License Downloads

Installation

pip install mteb

Usage

  • Using a python script:
from mteb import MTEB
from sentence_transformers import SentenceTransformer

# Define the sentence-transformers model name
model_name = "average_word_embeddings_komninos"

model = SentenceTransformer(model_name)
evaluation = MTEB(tasks=["Banking77Classification"])
results = evaluation.run(model, output_folder=f"results/{model_name}")
  • Using CLI
mteb --available_tasks

mteb -m average_word_embeddings_komninos \
    -t Banking77Classification  \
    --output_folder results/average_word_embeddings_komninos \
    --verbosity 3

Advanced usage

Tasks selection

Tasks can be selected by providing the list of tasks that needs to be run, but also

  • by their types (e.g. "Clustering" or "Classification")
evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks
  • by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)
evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence tasks
  • by their languages
evaluation = MTEB(task_langs=["en", "de"]) # Only select tasks which support "en", "de" or "en-de"

You can also specify which languages to load for multilingual/crosslingual tasks like this:

from mteb.tasks import AmazonReviewsClassification, BUCCBitextMining

evaluation = MTEB(tasks=[
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" subset of BUCC
])

Evaluation split

We can choose to evaluate only on test splits of all tasks by doing the following:

evaluation.run(model, eval_splits=["test"])

Using a custom model

Models should implement the following interface, implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.). For inspiration, you can look at the mtebscripts repo used for running diverse models via SLURM scripts for the paper.

class MyModel():
    def encode(self, sentences, batch_size=32, **kwargs):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

If you'd like to use different encoding functions for query and corpus when evaluating a Dense Retrieval Exact Search (DRES) model on retrieval tasks from BeIR, you can make your model DRES compatible. If compatible like the below example, it will be used for BeIR upon evaluation.

from mteb import AbsTaskRetrieval, DRESModel

class MyModel(DRESModel):
    # Refer to the code of DRESModel for the methods to overwrite
    pass

assert AbsTaskRetrieval.is_dres_compatible(MyModel)

Evaluating on a custom task

To add a new task, you need to implement a new class that inherits from the AbsTask associated with the task type (e.g. AbsTaskReranking for reranking tasks). You can find the supported task types in here.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

Note: for multilingual tasks, make sure your class also inherits from the MultilingualTask class like in this example.

Leaderboard

The MTEB Leaderboard is available here. To submit:

  1. Run your model on MTEB
  2. Format the json files into metadata using the script at scripts/other/mteb_meta.py. For example python scripts/mteb_meta.py path_to_results_folder, which will create a mteb_metadata.md file. If you ran CQADupstack retrieval, make sure to merge the results first with python scripts/merge_cqadupstack.py path_to_results_folder.
  3. Copy the content of the mteb_metadata.md file to the top of a README.md file of your model on the Hub. See here for an example.
  4. Refresh the leaderboard and you should see your scores 🥇

Available tasks

Name Hub URL Description Type Category #Languages Train #Samples Dev #Samples Test #Samples Avg. chars / train Avg. chars / dev Avg. chars / test
BUCC mteb/bucc-bitext-mining BUCC bitext mining dataset BitextMining s2s 4 0 0 641684 0 0 101.3
Tatoeba mteb/tatoeba-bitext-mining 1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus BitextMining s2s 112 0 0 2000 0 0 39.4
AmazonCounterfactualClassification mteb/amazon_counterfactual A collection of Amazon customer reviews annotated for counterfactual detection pair classification. Classification s2s 4 4018 335 670 107.3 109.2 106.1
AmazonPolarityClassification mteb/amazon_polarity Amazon Polarity Classification Dataset. Classification s2s 1 3600000 0 400000 431.6 0 431.4
AmazonReviewsClassification mteb/amazon_reviews_multi A collection of Amazon reviews specifically designed to aid research in multilingual text classification. Classification s2s 6 1200000 30000 30000 160.5 159.2 160.4
Banking77Classification mteb/banking77 Dataset composed of online banking queries annotated with their corresponding intents. Classification s2s 1 10003 0 3080 59.5 0 54.2
EmotionClassification mteb/emotion Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. Classification s2s 1 16000 2000 2000 96.8 95.3 96.6
ImdbClassification mteb/imdb Large Movie Review Dataset Classification p2p 1 25000 0 25000 1325.1 0 1293.8
MassiveIntentClassification mteb/amazon_massive_intent MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages Classification s2s 51 11514 2033 2974 35.0 34.8 34.6
MassiveScenarioClassification mteb/amazon_massive_scenario MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages Classification s2s 51 11514 2033 2974 35.0 34.8 34.6
MTOPDomainClassification mteb/mtop_domain MTOP: Multilingual Task-Oriented Semantic Parsing Classification s2s 6 15667 2235 4386 36.6 36.5 36.8
MTOPIntentClassification mteb/mtop_intent MTOP: Multilingual Task-Oriented Semantic Parsing Classification s2s 6 15667 2235 4386 36.6 36.5 36.8
ToxicConversationsClassification mteb/toxic_conversations_50k Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. Classification s2s 1 50000 0 50000 298.8 0 296.6
TweetSentimentExtractionClassification mteb/tweet_sentiment_extraction Classification s2s 1 27481 0 3534 68.3 0 67.8
ArxivClusteringP2P mteb/arxiv-clustering-p2p Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category Clustering p2p 1 0 0 732723 0 0 1009.9
ArxivClusteringS2S mteb/arxiv-clustering-s2s Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category Clustering s2s 1 0 0 732723 0 0 74.0
BiorxivClusteringP2P mteb/biorxiv-clustering-p2p Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category. Clustering p2p 1 0 0 75000 0 0 1666.2
BiorxivClusteringS2S mteb/biorxiv-clustering-s2s Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category. Clustering s2s 1 0 0 75000 0 0 101.6
MedrxivClusteringP2P mteb/medrxiv-clustering-p2p Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category. Clustering p2p 1 0 0 37500 0 0 1981.2
MedrxivClusteringS2S mteb/medrxiv-clustering-s2s Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category. Clustering s2s 1 0 0 37500 0 0 114.7
RedditClustering mteb/reddit-clustering Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. Clustering s2s 1 0 0 420464 0 0 64.7
RedditClusteringP2P mteb/reddit-clustering-p2p Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. Clustering p2p 1 0 0 459399 0 0 727.7
StackExchangeClustering mteb/stackexchange-clustering Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. Clustering s2s 1 0 417060 373850 0 56.8 57.0
StackExchangeClusteringP2P mteb/stackexchange-clustering-p2p Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. Clustering p2p 1 0 0 75000 0 0 1090.7
TwentyNewsgroupsClustering mteb/twentynewsgroups-clustering Clustering of the 20 Newsgroups dataset (subject only). Clustering s2s 1 0 0 59545 0 0 32.0
SprintDuplicateQuestions mteb/sprintduplicatequestions-pairclassification Duplicate questions from the Sprint community. PairClassification s2s 1 0 101000 101000 0 65.2 67.9
TwitterSemEval2015 mteb/twittersemeval2015-pairclassification Paraphrase-Pairs of Tweets from the SemEval 2015 workshop. PairClassification s2s 1 0 0 16777 0 0 38.3
TwitterURLCorpus mteb/twitterurlcorpus-pairclassification Paraphrase-Pairs of Tweets. PairClassification s2s 1 0 0 51534 0 0 79.5
AskUbuntuDupQuestions mteb/askubuntudupquestions-reranking AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar Reranking s2s 1 0 0 2255 0 0 52.5
MindSmallReranking mteb/mind_small Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research Reranking s2s 1 231530 0 107968 69.0 0 70.9
SciDocsRR mteb/scidocs-reranking Ranking of related scientific papers based on their title. Reranking s2s 1 0 19594 19599 0 69.4 69.0
StackOverflowDupQuestions mteb/stackoverflowdupquestions-reranking Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python Reranking s2s 1 23018 0 3467 49.6 0 49.8
ArguAna BeIR/arguana NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval Retrieval p2p 1 0 0 10080 0 0 1052.9
ClimateFEVER BeIR/climate-fever CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change. Retrieval s2p 1 0 0 5418128 0 0 539.1
CQADupstackAndroidRetrieval BeIR/cqadupstack/android CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 23697 0 0 578.7
CQADupstackEnglishRetrieval BeIR/cqadupstack/english CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 41791 0 0 467.1
CQADupstackGamingRetrieval BeIR/cqadupstack/gaming CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 46896 0 0 474.7
CQADupstackGisRetrieval BeIR/cqadupstack/gis CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 38522 0 0 991.1
CQADupstackMathematicaRetrieval BeIR/cqadupstack/mathematica CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 17509 0 0 1103.7
CQADupstackPhysicsRetrieval BeIR/cqadupstack/physics CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 39355 0 0 799.4
CQADupstackProgrammersRetrieval BeIR/cqadupstack/programmers CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 33052 0 0 1030.2
CQADupstackStatsRetrieval BeIR/cqadupstack/stats CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 42921 0 0 1041.0
CQADupstackTexRetrieval BeIR/cqadupstack/tex CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 71090 0 0 1246.9
CQADupstackUnixRetrieval BeIR/cqadupstack/unix CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 48454 0 0 984.7
CQADupstackWebmastersRetrieval BeIR/cqadupstack/webmasters CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 17911 0 0 689.8
CQADupstackWordpressRetrieval BeIR/cqadupstack/wordpress CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2p 1 0 0 49146 0 0 1111.9
DBPedia BeIR/dbpedia-entity DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base Retrieval s2p 1 0 4635989 4636322 0 310.2 310.1
FEVER BeIR/fever FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. Retrieval s2p 1 0 0 5423234 0 0 538.6
FiQA2018 BeIR/fiqa Financial Opinion Mining and Question Answering Retrieval s2p 1 0 0 58286 0 0 760.4
HotpotQA BeIR/hotpotqa HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. Retrieval s2p 1 0 0 5240734 0 0 288.6
MSMARCO BeIR/msmarco MS MARCO is a collection of datasets focused on deep learning in search Retrieval s2p 1 0 8848803 8841866 0 336.6 336.8
MSMARCOv2 BeIR/msmarco-v2 MS MARCO is a collection of datasets focused on deep learning in search Retrieval s2p 1 138641342 138368101 0 341.4 342.0 0
NFCorpus BeIR/nfcorpus NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval Retrieval s2p 1 0 0 3956 0 0 1462.7
NQ BeIR/nq NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval Retrieval s2p 1 0 0 2684920 0 0 492.7
QuoraRetrieval BeIR/quora QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions. Retrieval s2s 1 0 0 532931 0 0 62.9
SCIDOCS BeIR/scidocs SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. Retrieval s2p 1 0 0 26657 0 0 1161.9
SciFact BeIR/scifact SciFact verifies scientific claims using evidence from the research literature containing scientific paper abstracts. Retrieval s2p 1 0 0 5483 0 0 1422.3
Touche2020 BeIR/webis-touche2020 Touché Task 1: Argument Retrieval for Controversial Questions Retrieval s2p 1 0 0 382594 0 0 1720.1
TRECCOVID BeIR/trec-covid TRECCOVID is an ad-hoc search challenge based on the CORD-19 dataset containing scientific articles related to the COVID-19 pandemic Retrieval s2p 1 0 0 171382 0 0 1117.4
BIOSSES mteb/biosses-sts Biomedical Semantic Similarity Estimation. STS s2s 1 0 0 200 0 0 156.6
SICK-R mteb/sickr-sts Semantic Textual Similarity SICK-R dataset as described here: STS s2s 1 0 0 19854 0 0 46.1
STS12 mteb/sts12-sts SemEval STS 2012 dataset. STS s2s 1 4468 0 6216 100.7 0 64.7
STS13 mteb/sts13-sts SemEval STS 2013 dataset. STS s2s 1 0 0 3000 0 0 54.0
STS14 mteb/sts14-sts SemEval STS 2014 dataset. Currently only the English dataset STS s2s 1 0 0 7500 0 0 54.3
STS15 mteb/sts15-sts SemEval STS 2015 dataset STS s2s 1 0 0 6000 0 0 57.7
STS16 mteb/sts16-sts SemEval STS 2016 dataset STS s2s 1 0 0 2372 0 0 65.3
STS17 mteb/sts17-crosslingual-sts STS 2017 dataset STS s2s 11 0 0 500 0 0 43.3
STS22 mteb/sts22-crosslingual-sts SemEval 2022 Task 8: Multilingual News Article Similarity STS s2s 18 0 0 8060 0 0 1992.8
STSBenchmark mteb/stsbenchmark-sts Semantic Textual Similarity Benchmark (STSbenchmark) dataset. STS s2s 1 11498 3000 2758 57.6 64.0 53.6
SummEval mteb/summeval Biomedical Semantic Similarity Estimation. Summarization s2s 1 0 0 2800 0 0 359.8

Citation

If you find MTEB useful, feel free to cite our publication MTEB: Massive Text Embedding Benchmark:

@article{muennighoff2022mteb,
  doi = {10.48550/ARXIV.2210.07316},
  url = {https://arxiv.org/abs/2210.07316},
  author = {Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo{\"\i}c and Reimers, Nils},
  title = {MTEB: Massive Text Embedding Benchmark},
  publisher = {arXiv},
  journal={arXiv preprint arXiv:2210.07316},  
  year = {2022}
}