RAG Evaluator is a Python library for evaluating Retrieval-Augmented Generation (RAG) systems. It provides various metrics to evaluate the quality of generated text against reference text.
You can install the library using pip:
pip install rag-evaluator
Here's how to use the RAG Evaluator library:
from rag_evaluator import RAGEvaluator
# Initialize the evaluator
evaluator = RAGEvaluator()
# Input data
question = "What are the causes of climate change?"
response = "Climate change is caused by human activities."
reference = "Human activities such as burning fossil fuels cause climate change."
# Evaluate the response
metrics = evaluator.evaluate_all(question, response, reference)
# Print the results
print(metrics)
To run the web app:
- cd into streamlit app folder.
- Create a virtual env
- Activate the virtual env
- Install all dependencies
- Run the app:
streamlit run app.py
The RAG Evaluator provides the following metrics:
-
BLEU (0-100): Measures the overlap between the generated output and reference text based on n-grams.
- 0-20: Low similarity, 20-40: Medium-low, 40-60: Medium, 60-80: High, 80-100: Very high
-
ROUGE-1 (0-1): Measures the overlap of unigrams between the generated output and reference text.
- 0.0-0.2: Poor overlap, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent
-
BERT Score (0-1): Evaluates the semantic similarity using BERT embeddings (Precision, Recall, F1).
- 0.0-0.5: Low similarity, 0.5-0.7: Moderate, 0.7-0.8: Good, 0.8-0.9: High, 0.9-1.0: Very high
-
Perplexity (1 to ∞, lower is better): Measures how well a language model predicts the text.
- 1-10: Excellent, 10-50: Good, 50-100: Moderate, 100+: High (potentially nonsensical)
-
Diversity (0-1): Measures the uniqueness of bigrams in the generated output.
- 0.0-0.2: Very low, 0.2-0.4: Low, 0.4-0.6: Moderate, 0.6-0.8: High, 0.8-1.0: Very high
-
Racial Bias (0-1): Detects the presence of biased language in the generated output.
- 0.0-0.2: Low probability, 0.2-0.4: Moderate, 0.4-0.6: High, 0.6-0.8: Very high, 0.8-1.0: Extreme
-
MAUVE (0-1): MAUVE captures contextual meaning, coherence, and fluency while measuring both semantic similarity and stylistic alignment .
- 0.0-0.2 (Poor), 0.2-0.4 (Fair), 0.4-0.6 (Good), 0.6-0.8 (Very good), 0.8-1.0 (Excellent).
-
METEOR (0-1): Calculates semantic similarity considering synonyms and paraphrases.
- 0.0-0.2: Poor, 0.2-0.4: Fair, 0.4-0.6: Good, 0.6-0.8: Very good, 0.8-1.0: Excellent
-
CHRF (0-1): Computes Character n-gram F-score for fine-grained text similarity.
- 0.0-0.2: Low, 0.2-0.4: Moderate, 0.4-0.6: Good, 0.6-0.8: High, 0.8-1.0: Very high
-
Flesch Reading Ease (0-100): Assesses text readability.
- 0-30: Very difficult, 30-50: Difficult, 50-60: Fairly difficult, 60-70: Standard, 70-80: Fairly easy, 80-90: Easy, 90-100: Very easy
- Flesch-Kincaid Grade (0-18+): Indicates the U.S. school grade level needed to understand the text.
- 1-6: Elementary, 7-8: Middle school, 9-12: High school, 13+: College level
To run the tests, use the following command:
python -m unittest discover -s rag_evaluator -p "test_*.py"
This project is licensed under the MIT License. See the LICENSE file for details.
Contributions are welcome! If you have any improvements, suggestions, or bug fixes, feel free to create a pull request (PR) or open an issue on GitHub. Please ensure your contributions adhere to the project's coding standards and include appropriate tests.
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes.
- Run tests to ensure everything is working.
- Commit your changes and push to your fork.
- Create a pull request (PR) with a detailed description of your changes.
If you have any questions or need further assistance, feel free to reach out via email.