promptfoo
helps you tune LLM prompts systematically across many relevant test cases.
With promptfoo, you can:
- Test multiple prompts against predefined test cases
- Evaluate quality and catch regressions by comparing LLM outputs side-by-side
- Speed up evaluations with caching and concurrent tests
- Flag bad outputs automatically by setting "expectations"
- Use as a command line tool, or integrate into your workflow as a library
- Use OpenAI models, open-source models like Llama and Vicuna, or integrate custom API providers for any LLM API
The goal: test-driven prompt engineering, rather than trial-and-error.
promptfoo produces matrix views that let you quickly evaluate outputs across many prompts.
Here's an example of a side-by-side comparison of multiple prompts and inputs:
It works on the command line too:
Start by establishing a handful of test cases - core use cases and failure cases that you want to ensure your prompt can handle.
As you explore modifications to the prompt, use promptfoo eval
to rate all outputs. This ensures the prompt is actually improving overall.
As you collect more examples and establish a user feedback loop, continue to build the pool of test cases.
To get started, run this command:
npx promptfoo init
This will create some placeholders in your current directory: prompts.txt
and promptfooconfig.yaml
.
After editing the prompts and variables to your liking, run the eval command to kick off an evaluation:
npx promptfoo eval
The YAML configuration format runs each prompt through a series of example inputs (aka "test case") and checks if they meet requirements (aka "assert").
See the Configuration docs for a detailed guide.
prompts: [prompts.txt]
providers: [openai:gpt-3.5-turbo]
tests:
- description: First test case - automatic review
vars:
var1: first variable's value
var2: another value
var3: some other value
assert:
- type: equality
value: expected LLM output goes here
- type: function
value: output.includes('some text')
- description: Second test case - manual review
# Test cases don't need assertions if you prefer to review the output yourself
vars:
var1: new value
var2: another value
var3: third value
- description: Third test case - other types of automatic review
vars:
var1: yet another value
var2: and another
var3: dear llm, please output your response in json format
assert:
- type: contains-json
- type: similarity
value: ensures that output is semantically similar to this text
- type: llm-rubric
value: ensure that output contains a reference to X
Some people prefer to configure their LLM tests in a CSV. In that case, the config is pretty simple:
prompts: [prompts.txt]
providers: [openai:gpt-3.5-turbo]
tests: tests.csv
See example CSV.
If you're looking to customize your usage, you have a wide set of parameters at your disposal.
Option | Description |
---|---|
-p, --prompts <paths...> |
Paths to prompt files, directory, or glob |
-r, --providers <name or path...> |
One of: openai:chat, openai:completion, openai:model-name, localai:chat:model-name, localai:completion:model-name. See API providers |
-o, --output <path> |
Path to output file (csv, json, yaml, html) |
--tests <path> |
Path to external test file |
-c, --config <path> |
Path to configuration file. promptfooconfig.js/json/yaml is automatically loaded if present |
-j, --max-concurrency <number> |
Maximum number of concurrent API calls |
--table-cell-max-length <number> |
Truncate console table cells to this length |
--prompt-prefix <path> |
This prefix is prepended to every prompt |
--prompt-suffix <path> |
This suffix is append to every prompt |
--grader |
Provider that will conduct the evaluation, if you are using LLM to grade your output |
After running an eval, you may optionally use the view
command to open the web viewer:
npx promptfoo view
In this example, we evaluate whether adding adjectives to the personality of an assistant bot affects the responses:
npx promptfoo eval -p prompts.txt -r openai:gpt-3.5-turbo -t tests.csv
This command will evaluate the prompts in prompts.txt
, substituing the variable values from vars.csv
, and output results in your terminal.
You can also output a nice spreadsheet, JSON, YAML, or an HTML file:
In the next example, we evaluate the difference between GPT 3 and GPT 4 outputs for a given prompt:
npx promptfoo eval -p prompts.txt -r openai:gpt-3.5-turbo openai:gpt-4 -o output.html
Produces this HTML table:
You can also use promptfoo
as a library in your project by importing the evaluate
function. The function takes the following parameters:
-
testSuite
: the Javascript equivalent of the promptfooconfig.yamlinterface TestSuiteConfig { providers: string[]; // Valid provider name (e.g. openai:gpt-3.5-turbo) prompts: string[]; // List of prompts tests: string | TestCase[]; // Path to a CSV file, or list of test cases defaultTest?: Omit<TestCase, 'description'>; // Optional: add default vars and assertions on test case outputPath?: string; // Optional: write results to file } interface TestCase { description?: string; vars?: Record<string, string>; assert?: Assertion[]; prompt?: PromptConfig; grading?: GradingConfig; } interface Assertion { type: 'equality' | 'is-json' | 'contains-json' | 'function' | 'similarity' | 'llm-rubric'; value?: string; threshold?: number; // For similarity assertions provider?: ApiProvider; // For assertions that require an LLM provider }
-
options
: misc options related to how the tests are runinterface EvaluateOptions { maxConcurrency?: number; showProgressBar?: boolean; generateSuggestions?: boolean; }
promptfoo
exports an evaluate
function that you can use to run prompt evaluations.
import promptfoo from 'promptfoo';
const results = await promptfoo.evaluate({
prompts: ['Rephrase this in French: {{body}}', 'Rephrase this like a pirate: {{body}}'],
providers: ['openai:gpt-3.5-turbo'],
tests: [
{
vars: {
body: 'Hello world',
},
},
{
vars: {
body: "I'm hungry",
},
},
],
});
This code imports the promptfoo
library, defines the evaluation options, and then calls the evaluate
function with these options.
See the full example here, which includes an example results object.
- Main guide: Learn about how to configure your YAML file, setup prompt files, etc.
- Configuring test cases: Learn more about how to configure expected outputs and test assertions.
We support OpenAI's API as well as a number of open-source models. It's also to set up your own custom API provider. See Provider documentation for more details.
Contributions are welcome! Please feel free to submit a pull request or open an issue.
promptfoo
includes several npm scripts to make development easier and more efficient. To use these scripts, run npm run <script_name>
in the project directory.
Here are some of the available scripts:
build
: Transpile TypeScript files to JavaScriptbuild:watch
: Continuously watch and transpile TypeScript files on changestest
: Run test suitetest:watch
: Continuously run test suite on changes