This is the official documentation for the paper Flatness-Aware Prompt Selection Improves Accuracy and Sample Efficiency.
To run the codes, follow the steps below: Install the required dependencies as followings:
pip install -r requirements.txt
Get the metrics scores for the prompts as follows:
CUDA_VISIBLE_DEVICES=0 python main.py \
--model="gpt2" \
--dataset=agnews \
--num_seeds=1 \
--all_shots = 4 \
--subsample_test_set=512 \
--approx
all_shots
: Number of demonstrationsmodel
: the selected modeldataset
: dataset namesubsample_test_set
: size of test set to use to speed up eval. None means using all test set
After running the codes above, you'll get results (pickle file). For each experiment, we store a result tree in the following format:
{
seed_id: {
id: {
// prompt level info
id: prompt_id,
promt: prompt_text,
sen: sen_score,
mi: mi_score,
perf: performance (acc),
}
// seed-level info: correlations across prompt
sen_p: ,
sen_s: ,
mi_p: ..,
mi_s: ..,
}
// top level info like avg sensitivity avg accuracy etc. is calculated by print_results function. they are not stored in the pickle
}
id
: the prompt idpromt
: the contents of promptsen
: the sensitivity of the promptmi
: multual information of the promptperf
: accuracy of the promptsen_p
: Pearson correlation between performance and sensitivitysen_s
: Spearman correlation between performance and sensitivitymi_p
: Pearson correlation between performance and mutual informationmi_s
: Spearman correlation between performance and mutual information
After obtaining the correlation between metrics scores and performance on the dev-set, we tune the alpha that maximizes the correlation or other metrics (e.g., NDCG). Then fix it, and run on the large test set.
To set your own custom prompts, you can change it at promptset in main.py
If you have any questions, suggestions, or concerns, please reach out to us.