/Retrieval-QA-Benchmark

Benchmark baseline for retrieval qa applications

Primary LanguageJupyter NotebookGNU General Public License v3.0GPL-3.0

RQABench: Retrieval QA Benchmark

Documentation Status

Retreival QA Benchmark (RQABench in short) is an open-sourced, end-to-end test workbench for Retrieval Augmented Generation (RAG) systems. We intend to build an open benchmark for all developers and researchers to reproduce and design new RAG systems. We also want to create a platform for everyone to share their lego blocks to help others to build up their own retrieval + LLM system.

Here are some major feature of this benchmark:

  • Flexibility: We maximize the flexibility when design your retrieval system, as long as your transform accept QARecord as input and QARecord as output.
  • Reproducibility: We gather all settings in the evaluation process into a single YAML configuration. It helps you to track and reproduce experiements.
  • Traceability: We collect more than the accuracy and scores. We also focus on running times on any function you want to watch and the tokens used in the whole RAG system.

Getting started

Clone and install

# Clone to your local machine
git clone https://github.com/myscale/Retrieval-QA-Benchmark
# install it as editable package
cd Retrieval-QA-Benchmark && python3 -m pip3 install -e .

Run it

from retrieval_qa_benchmark.models import *
from retrieval_qa_benchmark.datasets import *
from retrieval_qa_benchmark.transforms import *
from retrieval_qa_benchmark.evaluators import *
from retrieval_qa_benchmark.utils.profiler import PROFILER
# This is for loading our special yaml configuration with `!include` keyword
from retrieval_qa_benchmark.utils.config import load
# This is where you can contruct evaluator from config
from retrieval_qa_benchmark.utils.factory import EvaluatorFactory

# This will print all loaded modules. You can also use it as reference to edit your configuration
print(str(REGISTRY))

# Choose a configuration to evaluatoe
config = load(open("config/mmlu.yaml"))
evaluator = EvaluatorFactory.from_config(config).build()

# evaluator will return accuracy in float and list of `QAPrediction`
acc, result = evaluator()

# you can set out_file to generate a JSONL file or write it as your own.
with open("some-file-name-to-store-result.jsonl", "w") as f:
    f.write("\n".join([r.model_dump_json() for r in result]))

Replicate our FAISS / MyScale Benchmark

  1. RAG with FAISS
  • Download the index file for wikipedia here (around 26G).
  • Download dataset from huggingface with our code (around 140G). It will automatically download the dataset for the first time.
  • Set the index path to the download index.
  1. RAG with MyScale
  • Download the data for wikipedia in parquet here.
  • Insert the data and create vector index. You can also directly use our free pod hosting the Wikipedia data as described here.

Result with Simple RAG pipeline

with MyScale

Setup Dataset Average
LLM Contexts mmlu-astronomy mmlu-prehistory mmlu-global-facts mmlu-college-medicine mmlu-clinical-knowledge
gpt-3.5-turbo 71.71% 70.37% 38.00% 67.63% 74.72% 68.05%

(Top-1)
75.66%
(+3.95%)
78.40%
(+8.03%)
46.00%
(+8.00%)
67.05%
(-0.58%)
73.21%
(-1.51%)
71.50%
(+3.45%)

(Top-3)
76.97%
(+5.26%)
81.79%
(+11.42%)
48.00%
(+10.00%)
65.90%
(-1.73%)
73.96%
(-0.76%)
72.98%
(+4.93%)

(Top-5)
78.29%
(+6.58%)
79.63%
(+9.26%)
42.00%
(+4.00%)
68.21%
(+0.58%)
74.34%
(-0.38%)
72.39%
(+4.34%)

(Top-10)
78.29%
(+6.58%)
79.32%
(+8.95%)
44.00%
(+6.00%)
71.10%
(+3.47%)
75.47%
(+0.75%)
73.27%
(+5.22%)
llama2-13b-chat-q6_0 53.29% 57.41% 33.00% 44.51% 50.19% 50.30%

(Top-1)
58.55%
(+5.26%)
61.73%
(+4.32%)
45.00%
(+12.00%)
46.24%
(+1.73%)
54.72%
(+4.53%)
55.13%
(+4.83%)

(Top-3)
63.16%
(+9.87%)
63.27%
(+5.86%)
49.00%
(+16.00%)
46.82%
(+2.31%)
55.85%
(+5.66%)
57.10%
(+6.80%)

(Top-5)
63.82%
(+10.53%)
65.43%
(+8.02%)
51.00%
(+18.00%)
51.45%
(+6.94%)
57.74%
(+7.55%)
59.37%
(+9.07%)

(Top-10)
65.13%
(+11.84%)
66.67%
(+9.26%)
46.00%
(+13.00%)
49.71%
(+5.20%)
57.36%
(+7.17%)
59.07%
(+8.77%)
* The benchmark uses MyScale MSTG as vector index
* This benchmark can be reproduced with our github repository retrieval-qa-benchmark

with FAISS

Setup Dataset Average
LLM Contexts mmlu-astronomy mmlu-prehistory mmlu-global-facts mmlu-college-medicine mmlu-clinical-knowledge
gpt-3.5-turbo 71.71% 70.37% 38.00% 67.63% 74.72% 68.05%

(Top-1)
75.00%
(+3.29%)
77.16%
(+6.79%)
44.00%
(+6.00%)
66.47%
(-1.16%)
73.58%
(-1.14%)
70.81%
(+2.76%)

(Top-3)
75.66%
(+3.95%)
80.25%
(+9.88%)
44.00%
(+6.00%)
65.90%
(-1.73%)
73.21%
(-1.51%)
71.70%
(+3.65%)

(Top-5)
78.29%
(+6.58%)
79.32%
(+8.95%)
46.00%
(+8.00%)
65.90%
(-1.73%)
73.58%
(-1.14%)
72.09%
(+4.04%)

(Top-10)
78.29%
(+6.58%)
80.86%
(+10.49%)
49.00%
(+11.00%)
69.94%
(+2.31%)
75.85%
(+1.13%)
74.16%
(+6.11%)
llama2-13b-chat-q6_0 53.29% 57.41% 33.00% 44.51% 50.19% 50.30%

(Top-1)
57.89%
(+4.60%)
61.42%
(+4.01%)
48.00%
(+15.00%)
45.66%
(+1.15%)
55.09%
(+4.90%)
55.22%
(+4.92%)

(Top-3)
59.21%
(+5.92%)
65.74%
(+8.33%)
50.00%
(+17.00%)
50.29%
(+5.78%)
56.98%
(+6.79%)
58.28%
(+7.98%)

(Top-5)
65.79%
(+12.50%)
64.51%
(+7.10%)
48.00%
(+15.00%)
50.29%
(+5.78%)
58.11%
(+7.92%)
58.97%
(+8.67%)

(Top-10)
65.13%
(+11.84%)
66.05%
(+8.64%)
48.00%
(+15.00%)
47.40%
(+2.89%)
56.23%
(+6.04%)
58.38%
(+8.08%)
* The benchmark uses FAISS IVFSQ (nprobes=128) as vector index
* This benchmark can be reproduced with our github repository retrieval-qa-benchmark