/AIR-Bench

AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark

Primary LanguagePythonMIT LicenseMIT

☁️ Motivation

Evaluation is crucial for the development of information retrieval models. In recent years, a series of milestone works have been introduced to the community, such as MSMARCO, Natural Question (open-domain QA), MIRACL (multilingual retrieval), BEIR and MTEB (general-domain zero-shot retrieval). However, the existing benchmarks are severely limited in the following perspectives.

  • Incapability of dealing with new domains. All of the existing benchmarks are static, which means they are established for the pre-defined domains based on human labeled data. Therefore, they are incapable of dealing with new domains which are interested by the users.
  • Potential risk of over-fitting and data leakage. The existing retrievers are intensively fine-tuned in order to achieve strong performances on popular benchmarks, like BEIR and MTEB. Despite that these benchmarks are initially designed for zero-shot evaluation of O.O.D. Evaluation, the in-domain training data is widely used during the fine-tuning process. What is worse, given the public availability of the existing evaluation datasets, the testing data could be falsely mixed into the retrievers' training set by mistake.

☁️ Features

  • 🤖 Automated. The testing data is automatically generated by large language models without human intervention. Therefore, it is able to instantly support the evaluation of new domains at a very small cost. Besides, the new testing data is almost impossible to be covered by the training sets of any existing retrievers.
  • 🔍 Retrieval and RAG-oriented. The new benchmark is dedicated to the evaluation of retrieval performance. In addition to the typical evaluation scenarios, like open-domain question answering or paraphrase retrieval, the new benchmark also incorporates a new setting called inner-document retrieval which is closely related with today's LLM and RAG applications. In this new setting, the model is expected to retrieve the relevant chunks of a very long documents, which contain the critical information to answer the input question.
  • 🔄 Heterogeneous and Dynamic: The testing data is generated w.r.t. diverse and constantly augmented domains and languages (i.e. Multi-domain, Multi-lingual). As a result, it is able to provide an increasingly comprehensive evaluation benchmark for the community developers.

☁️ Results

We plan to release new test dataset on regular basis. The latest version of is 24.04. You could check out the results at AIR-Bench Leaderboard.

Detailed results are available here.

☁️ Usage

Installation

This repo is used to maintain the codebases for running AIR-Bench evaluation. To run the evaluation, please install air-benchmark.

pip install air-benchmark

Evaluations

Refer to the steps below to run evaluations and submit the results to the leaderboard (refer to here for more detailed information).

  1. Run evaluations

    • See the scripts to run evaluations on AIR-Bench for your models.
  2. Submit search results

    • Package the output files

      • As for the results without reranking models,
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --save_dir search_results
      • As for the results with reranking models
      cd scripts
      python zip_results.py \
      --results_dir search_results \
      --retriever_name [YOUR_RETRIEVAL_MODEL] \
      --reranker_name [YOUR_RERANKING_MODEL] \
      --save_dir search_results
    • Upload the output .zip and fill in the model information at AIR-Bench Leaderboard

☁️ Documentation

Documentation
🏭 Pipeline The data generation pipeline of AIR-Bench
📋 Tasks Overview of available tasks in AIR-Bench
📈 Leaderboard The interactive leaderboard of AIR-Bench
🚀 Submit Information related to how to submit a model to AIR-Bench
🤝 Contributing How to contribute to AIR-Bench

☁️ Acknowledgement

This work is inspired by MTEB and BEIR. Many thanks for the early feedbacks from @tomaarsen, @Muennighoff, @takatost, @chtlp.

☁️ Citing

TBD