This project is designed to assess the quality of a dataset by evaluating each sample and determining if it should be
included in the cleaned version of the dataset. This project uses multiple Large Language Models (LLMs) as experts to
rate samples based on their content and provides a summary of the average rating along with a classification as bad
or good
.
The evaluation process uses a quorum-based approach, where each sample is rated by a pool of experts and a majority vote determines its inclusion in the cleaned dataset. If a majority vote has not been achieved, the sample will be excluded from the cleaned version.
- Uses multiple LLMs to rate samples based on their content
- Calculates the average rating for each sample
- Determines if the majority vote of quorum has been achieved for each sample
- Classifies the average rating as
bad
orgood
based on a threshold of 3.5
Each sample of dataset will be processed separately.
You may use any LLM as an expert for your quorum; the only limitation is that the remote API should be compatible with the OpenAI API client.
Example of experts.yml
configuration:
experts:
- model: gpt-3.5-turbo
- model: anthropic/claude-3-haiku
- model: perplexity/llama-3-sonar-small-32k-online
- model: google/palm-2-chat-bison-32k
- model: google/gemma-2-9b-it
Here you may set multiple models; they will work as experts in the quorum.
You may use different API keys, base URLs, and prompt templates:
experts:
- model: gpt-3.5-turbo
api_key: sk-XXXX
base_url: https://api.openai.com/v1
prompt_template: Evaluate how well this example conveys its meaning?\nPlease rate text below from 1 (poor) to 5 (excellent), RESPONSE ONLY ONE NUMBER:\n\n{{ context }}\n
- model: gpt-3.5-turbo
api_key: sk-YYYY
base_url: https://api.vsegpt.ru/v1
The template should at least include the {{ context }}
field.
Can you evaluate how well this example conveys its meaning, how well it is organized and structured, whether it fits the theme of the conversation, and whether its responses are accurate?
Please rate text below from 1 (bad) to 5 (good), RESPONSE ONLY ONE NUMBER:
{{ context }}
See prompt_template.txt for details.
- dqa.ipynb - standalone example with all classes and function used under the hood.
- dqa-simplified.ipynb - simplified example, it works in the same way
This project is licensed under the MIT License. See the LICENSE file for details.
If you use this project in your research or work, please cite it as follows:
[Pavel Rykov]. (2024). Quorum of LLMs for Dataset Quality Assessment. GitHub. https://github.com/EvilFreelancer/dqa-quorum
Alternatively, in BibTeX format:
@misc{pavelrykov2024dqaquorum,
author = {Pavel Rykov},
title = {Quorum of LLMs for Dataset Quality Assessment},
year = {2024},
url = {https://github.com/EvilFreelancer/dqa-quorum}
}