RAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.
You can install the library using pip:
pip install rag-evaluator
Here's how to use the RAG Evaluator library:
from rag_evaluator import RAGEvaluator
# Initialize the evaluator
evaluator = RAGEvaluator()
# Input data
question = "What are the causes of climate change?"
response = "Climate change is caused by human activities."
reference = "Human activities such as burning fossil fuels cause climate change."
# Evaluate the response
metrics = evaluator.evaluate_all(question, response, reference)
# Print the results
print(metrics)
To run the web app:
- cd into streamlit app folder.
- Create a virtual env
- Activate
- Install all dependencies
- and run
streamlit run app.py
The following metrics are provided by the library:
- BLEU: Measures the overlap between the generated output and reference text based on n-grams.
- ROUGE-1: Measures the overlap of unigrams between the generated output and reference text.
- BERT Score: Evaluates the semantic similarity between the generated output and reference text using BERT embeddings.
- Perplexity: Measures how well a language model predicts the text.
- Diversity: Measures the uniqueness of bigrams in the generated output.
- Racial Bias: Detects the presence of biased language in the generated output.
To run the tests, use the following command:
python -m unittest discover -s rag_evaluator -p "test_*.py"