PRompt Optimization in Multi-Step Tasks (PROMST): Integrating Human Feedback and Preference Alignment (EMNLP 2024, Oral, Top 3.4%)
(Website)
Paper Link: https://arxiv.org/pdf/2402.08702.pdf
This project is for automatic prompt optimization with a focus on multi-step tasks.
Please setup the environment with conda and pip as follows:
conda create -n PROMST python=3.10
conda activate PROMST
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
conda install conda-forge::tiktoken
pip install openai --upgrade
pip install pygame
conda install conda-forge::transformers
conda install anaconda::scikit-learn
pip install gym==0.26.2
pip install pddlgym
pip install tarski
Note that when doing the Logistics and BlocksWorld tasks, we slightly modify the pddlgym files. Thus, you need to substitute the original installed pddlgym directory in packages (e.g., /Users/Your user name/opt/anaconda3/envs/PROMST/lib/python3.10/site-packages/pddlgym) with the pddlgym directory in BlocksWorld/env_data_BlocksWorld/pddlgym. Like the following code:
cd /Users/Your user name/opt/anaconda3/envs/PROMST/lib/python3.10/site-packages
cp -r /Your path to PROMST/PROMST/BlocksWorld/env_data_BlocksWorld/pddlgym ./
Your system does not need GPU to train the score model if you set with_score_model = 'False'. To run each task, you enter into each task directory and set up the OPENAI_API key (sk-..) in LLM.py line 11. Then you can run the following code. The args can be changed with your preference.
python env3-box-arrange-train_MCTS.py
-experiment_trial_num 1
-input_error_prompt_token_limit 15000
-model_name_promptLLM gpt-4-1106-preview
-model_name_testLLM gpt-3.5-turbo-16k-0613
-min_level 2
-n_children 8
-n_selected 2
-prompt_method PROMST
-with_score_model 'False'
-Training_path ../BoxLift/train_set/
-Testing_path ../BoxLift/test_set/
-base_path ../BoxLift/
python env3-box-arrange-train_MCTS.py -experiment_trial_num 3 -input_error_prompt_token_limit 15000 -model_name_promptLLM gpt-4-1106-preview -model_name_testLLM gpt-3.5-turbo-16k-0613 -min_level 2 -n_children 4 -n_selected 2 -prompt_method PROMST -with_score_model 'False' -with_SumLLM 'False' -Training_path ../BoxLift/train_set/ -Testing_path ../BoxLift/test_set/ -base_path ../BoxLift/
Command for Alfworld/Scienceworld/Webarena
python agentboard/env9-box-arrange-train_MCTS.py --cfg-path eval_configs/main_results_all_tasks.yaml --tasks alfworld --log_path ./results/gpt-3.5-turbo-16k-0613 --project_name evaluate-gpt-4 --experiment_trial_num 1 --model_name_promptLLM gpt-4-1106-preview --model_name_testLLM gpt-3.5-turbo-16k-0613 --min_level 2 --n_children 4 --n_selected 2 --prompt_method PROMST --with_score_model 'False' --base_path ./alfworld_result/ --max_num_steps 30
Command for the restart from training process for Alfworld/Scienceworld/Webarena
python agentboard/env9-box-arrange-train_MCTS_restart.py --cfg-path eval_configs/main_results_all_tasks.yaml --tasks webarena --log_path ./results/gpt-3.5-turbo-16k-0613 --project_name evaluate-gpt-4 --experiment_trial_num 3 --model_name_promptLLM gpt-4-1106-preview --model_name_testLLM gpt-3.5-turbo-16k-0613 --min_level 2 --n_children 4 --n_selected 2 --prompt_method PROMST --with_score_model 'False' --base_path ./webarena_result/ --max_num_steps 30
Args to be set | Choices | Explanation |
---|---|---|
experiment_trial_num | 1, 2, 3, ... | The trial number to record each trial |
input_error_prompt_token_limit | 10000, 15000, 20000, ... | The prompt maximum token length for context sliding window |
model_name_promptLLM | gpt-3.5-turbo-16k-0613, gpt-3.5-turbo-0301, gpt-4-1106-preview | LLM type of PromptLLM |
model_name_testLLM | gpt-3.5-turbo-16k-0613, gpt-3.5-turbo-0301, gpt-4-1106-preview | LLM type of TestLLM |
min_level | 1, 2, 3, ... | Minimum number of prompt levels to be explored |
n_children | 4, 8, 12, ... | Number of prompts to be expanded in each level |
n_selected | 2, 3, 4, ... | Number of best prompts to be further optimized in each level |
prompt_method | PROMST, APE, APO | Prompt optimziation methods |
with_score_model | 'False', 'True' | Whether implementing socre model in PROMST |
Training_path | ../BoxLift/train_set/ | The path to the created train_set path |
Tesing_path | ../BoxLift/train_set/, ../BoxLift/test_set/ | The path to the created test_set path (be the same as the Training_path during optimization) |
base_path | ../BoxLift/ | The base path to save the prompt optimization results |
@article{chen2024prompt,
title={Prompt optimization in multi-step tasks (promst): Integrating human feedback and preference alignment},
author={Chen, Yongchao and Arkin, Jacob and Hao, Yilun and Zhang, Yang and Roy, Nicholas and Fan, Chuchu},
journal={arXiv preprint arXiv:2402.08702},
year={2024}
}
Large language models are human-level prompt engineers
Automatic Prompt Optimization with “Gradient Descent” and Beam Search
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?