FMEval is a library to evaluate Large Language Models (LLMs) and select the best LLM for your use case. The library can help evaluate LLMs for the following tasks:
- Open-ended generation - the production of natural language as a response to general prompts that do not have a pre-defined structure.
- Text summarization - summarizing the most important parts of a text, shortening a text while preserving its meaning.
- Question Answering - the generation of a relevant and accurate response to a question.
- Classification - assigning a category, such as a label or score, to text based on its content.
The library contains the following:
- Implementation of popular metrics (eval algorithms) such as Accuracy, Toxicity, Semantic Robustness and Prompt Stereotyping for evaluating LLMs across different tasks.
- Implementation of the ModelRunner interface. ModelRunner encapsulates the logic for invoking LLMs, exposing a predict method that greatly simplifies interactions with LLMs within eval algorithm code. The interface can be extended by the user for their LLMs. We have built-in support for AWS SageMaker Jumpstart Endpoints, AWS SageMaker Endpoints and Bedrock Models.
To install the package from PIP you can simply do:
pip install fmeval
You can see examples of running evaluations on your LLMs with built-in or custom datasets in the examples folder.
Main steps for using fmeval are:
- Create a ModelRunner which can perform invocations on your LLM. We have built-in support for AWS SageMaker Jumpstart Endpoints, AWS SageMaker Endpoints and AWS Bedrock Models. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
- Use any of the supported eval_algorithms.
eval_algo = get_eval_algorithm("toxicity", ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner)
Note: You can update the default eval config parameters for your specific use case.
We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms. You can choose to use a custom dataset in the following manner.
- Create a DataConfig for your custom dataset
config = DataConfig(
dataset_name="custom_dataset",
dataset_uri="./custom_dataset.jsonl",
dataset_mime_type="application/jsonlines",
model_input_location="question",
target_output_location="answer",
)
- Use an eval algorithm with a custom dataset
eval_algo = get_eval_algorithm("toxicity", ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config)
Please refer to code documentation and examples for understanding other details around the usage of eval algorithms.
Once a virtual environment is set up with python3.10, run the following command to install all dependencies:
./devtool all
We use poetry to manage python dependencies in this project. If you want to add a new
dependency, please update the pyproject.toml file, and run the poetry update
command to update the
poetry.lock
file (which is checked in).
Other than this step above to add dependencies, everything else should be managed with devtool commands.
Details TBA
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.