This is the official implementation of the paper Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers
Large Language Models (LLMs) excel in various tasks, but they rely on carefully crafted prompts that often demand substantial human effort. To automate this process, in this paper, we propose a novel framework for discrete prompt optimization, called EvoPrompt, which borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence. To enable EAs to work on discrete prompts, which are natural language expressions that need to be coherent and human-readable, we connect LLMs with EAs. This approach allows us to simultaneously leverage the powerful language processing capabilities of LLMs and the efficient optimization performance of EAs. Specifically, abstaining from any gradients or parameters, EvoPrompt starts from a population of prompts and iteratively generates new prompts with LLMs based on the evolutionary operators, improving the population based on the development set. We optimize prompts for both closed- and open-source LLMs including GPT-3.5 and Alpaca, on 31 datasets covering language understanding, generation tasks, as well as BIG-Bench Hard (BBH) tasks. EvoPrompt significantly outperforms human-engineered prompts and existing methods for automatic prompt generation (e.g., up to 25% on BBH). Furthermore, EvoPrompt demonstrates that connecting LLMs with EAs creates synergies, which could inspire further research on the combination of LLMs and conventional algorithms.
- Environmental settings:
pip install -r requirements.txt
- Data download: The test data for the language understanding task can be found here. Put the test file in the folder
./data/cls/{dataset_name}
. For datasets of BBH, download from the repo CoT-hub and put them in the folderBBH/data/{dataset_name}
. - OpenAI API key required: add your OpenAI API key and other related settings in the file
auth.yaml
We instanciate two evolutionary algorithms, GA (genetic algorithm) and DE (diffenrential evolution) to evolve upon the initial population. Evolve your prompts using the following commands:
Customize the parameter --llm_type
to use text-davinci-003
, gpt-3.5-turbo
, gpt-4
.
# understanding task on Alpaca
bash scripts/cls/run_ga_alpaca.sh # Genetic algorithm
bash scripts/cls/run_de_alpaca.sh # Differential evolution
# simplification task on Alpaca
bash scripts/sim/run_de_alpaca.sh
bash scripts/sim/run_ga_alpaca.sh
# summarization task on Alpaca
bash scripts/sum/run_de_alpaca.sh
bash scripts/sum/run_ga_alpaca.sh
# for BBH tasks
cd BBH
bash scripts/run_de_cot.sh # DE
bash scripts/run_ga_cot.sh # GA
To evaluate a single instruction, run the following, set the argument --content
to evaluate a performance of a specific prompt
bash scripts/cls/eval_single_alpaca.sh # understanding task on alpaca
bash scripts/sim/eval_single_alpaca.sh # simplification
bash scripts/sum/eval_single_alpaca.sh # summarization
# BBH
cd BBH
bash scripts/eval.sh # few-shot evaluation
Note that we have two language models used in our framework, one is for evolution (argument --llm_type
), the other for the task implementation (--language_model
).
The number of iteration and the population size effect the performance of EvoPrompt. There exists a trade-off between the cost and the performance. For relative simple tasks, a size of 10 and 10 iterative steps are enough, or even less. While for complex tasks, a larger population with diversity is required.
You may need to set the following arguments to customize your own configuration.
task
: the task category, such assim
(simplification),cls
(classification),sum
(summarization). If you need to extend this to other tasks, you may override the metric to evaluatedataset
: the dataset you want to evolve prompt ondev_file
: the path of the devlopment setlanguage model
: the model used for task implementationllm_type
: the LLM used to evolve promptsposition
: this argument mainly indicates whether to use demonstration (zero-shot or few-shot)sample_num
: the size of dev set, mainly used for generation task where there is no need to set thedev_file
prompt_num
: number of examples for few-shot demonstrations
For the pipeline of EvoPrompt, there are mainly three steps as follows, while for each of them algorthms, there exists slight differences to instantiate.
-
Initialization: We apply prompts generated manually written or generated by GPT as the initial population. (see in the
prompts.txt
andprompts_auto.txt
under the path of each dataset) -
Evolution (mutation and crossover): For templates used for DE and GA, see the file
./data/templates_ga
and./data/templates_de
. We use a demonstration including one example of the algorithm implementation to get precise and expected prompt following the steps of evolution. To avoid the LLMs copying the demonstration,the demonstration of the task is different from the task of implementation. -
Evaluation and update: After each iteration, we need select which prompts should be maintained in the population to update. For GA, we maintain top-$N$ prompts in each iteration while for DE, we replace the old prompt if the newly generated is better.
-
Selection strategy: in each iteration, we need to select parents for mutation and crossover, as donors to child prompts. Set the argument
sel_mode
to apply different strategy. There are three choices:["wheel", "random", "tour"]
, we usewheel
by default. -
Update: After generating a population with the same size of the original population,
$N$ , we select top-$N$ as the new population.
- Design in DE: We apply different DE templates for ablations. Specify the argument
template
to use different settings.- Eliminate Prompt 3:
--template v2
- Prompt 3 (random): add the argument
--donor_random
- Prompt 3 (best):
--template v1
(default setting) - Different part:
--template v3
- Eliminate Prompt 3:
- Update: Different from GA, in each iteration for each prompt
p
, several donor prompts are used for the new promptp'
, ifp'
is better thanp
,p
will be replaced byp'
. Otherwise, it will be maintained.
.
βββ args.py
βββ auth.yaml
βββ BBH # code for BBH tasks
βββ data # dataset, templates used
β βββ cls
β βββ sim
β βββ sum
β βββ template_de.py # templates of prompt evolution by DE
β βββ template_ga.py # templates of prompt evolution by GA
β βββ template_v2.json # templates for task implementation
β βββ templates.py # wrapper
βββ dataset.py # dataset class
βββ evaluator.py # evaluators on different tasks
βββ evoluter.py # DE, GA, APE
βββ evolution.py # DE, GA, APE
βββ get_result.py
βββ infer.py # main file for inference
βββ llm_client.py # LLM query
βββ metrics.py # metric calculation
βββ requirements.txt
βββ run.py # main file for evolution
βββ scripts # scripts to run the code
βββ utils.py # auxiliary functions
- Aggregation: Based on the final population of high quality, ensembling strategies can be effectively applied upon the prompts.
- More fine-grained metrics: to select prompt maintained in the population, we need to evaluate the performance on dev set. However, for understanding tasks, metrics such as accuracy or F1 are coarse-grained, sometimes it's not accurate anough to select which to keep in the population since the performances of them are the same.
- More complex tasks are left to explore.
If you find this repository helpful, please consider citing our paper:
@article{guo2023connecting,
title={Connecting Large Language Models with Evolutionary Algorithms Yields Powerful Prompt Optimizers},
author={Guo, Qingyan and Wang, Rui and Guo, Junliang and Li, Bei and Song, Kaitao and Tan, Xu and Liu, Guoqing and Bian, Jiang and Yang, Yujiu},
journal={arXiv preprint arXiv:2309.08532},
year={2023}
}
Our codebase is based on the following repos. Thanks for open-sourcing!
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.