😐😨EmotionBench😠😭

RESEARCH USE ONLY✅ NO COMMERCIAL USE ALLOWED❌

Benchmarking LLMs' Empathy Ability.

🛠️ Usage

✨An example run:

python run_emotionbench.py \
  --model gpt-3.5-turbo \
  --questionnaire PANAS \
  --emotion ALL \
  --select-count 5 \
  --default-shuffle-count 2 \
  --emotion-shuffle-count 1 \
  --test-count 1

✨An example result of overall analysis:

Emotions	Positive Affect	Negative Affect	N
Default	43.3 $\pm$ 2.5	25.3 $\pm$ 0.6	3
Anger	$\downarrow$ (-18.8)	$-$ (-0.3)	2
Anxiety	$\downarrow$ (-11.3)	$\downarrow$ (-3.8)	2
Overall	$\downarrow$ (-15.1)	$-$ (-2.1)	4

✨An example result of specific emotion analysis:

Factors	Positive Affect	Negative Affect	N
Default	43.3 $\pm$ 2.5	25.3 $\pm$ 0.6	3
Facing Self-Opinioned People	$\downarrow$ (-18.8)	$-$ (-0.3)	2
Overall	$\downarrow$ (-18.8)	$-$ (-0.3)	2

🔧 Argument Specification

--model: (Required) The name of the model to test.
--questionnaire: (Required) Select the questionnaire(s) to run. For choices please see the list below.
--emotion: (Required) Select the emotion(s) to run. For choices please see the list below.
--select-count: (Required) Numbers of situations to select per factor. Defaults to 999 (select all situations).
--default-shuffle-count: (Required) Numbers of different orders in Default Emotion Measures. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero.
--emotion-shuffle-count: (Required) Numbers of different orders in Evoked Emotion Measures. If set zero, run only the original order. If set n > 0, run the original order along with its n permutations. Defaults to zero.
--test-count: (Required) Numbers of runs for a same order. Defaults to one.
--name-exp: Name of this run. Is used to name the result files.
--significance-level: The significance level for testing the difference of means between human and LLM. Defaults to 0.01.
--mode: For debugging. To choose which part of the code is running.

Arguments related to openai API (can be discarded when users customize models):

--openai-organization: Your organization ID. Can be found in Manage account -> Settings -> Organization ID.
--openai-key: Your API key. Can be found in View API keys -> API keys.

🔨 Emotion Selection

Supported emotions: Anger, Anxiety, Depression, Frustration, Jealousy, Guilt, Fear, Embarrassment

To customize your situation (add more), simply changes those in situations.csv.

✨An example of situations.csv:

Anger-0	Anger-1	$\cdots$	Anxiety-0	Anxiety-1	$\cdots$
Facing Self-Opinioned People	Blaming, Slandering, and Tattling	$\cdots$	External Factors	Self-Imposed Pressure	$\cdots$
When you ...	When your ...	$\cdots$	You are ...	You have ...	$\cdots$
$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\vdots$	$\ddots$

📃 Questionnaire List

Positive And Negative Affect Schedule: --questionnaire PANAS (--emotion ALL)
Aggression Questionnaire: --questionnaire AGQ (--emotion Anger)
Short-form Depression Anxiety Stress Scales: --questionnaire DASS-21 (--emotion Anxiety)
Beck Depression Inventory: --questionnaire BDI (--emotion Depression)
Frustration Discomfort Scale: --questionnaire FDS (--emotion Frustration)
Multidimensional Jealousy Scale: --questionnaire MJS (--emotion Jealousy)
Guilt And Shame Proneness: --questionnaire GASP (--emotion Guilt)
Fear Survey Schedule: --questionnaire FSS (--emotion Fear)
Brief Fear of Negative Evaluation: --questionnaire BFNE (--emotion Embarrassment)

🚀 Benchmarking Your Own Model

It is easy! Just replace the function example_generator fed into the function run_psychobench(args, generator).

Your customized function your_generator() does the following things:

Read questions from the file args.testing_file. The file locates under results/ (check run_psychobench() in utils.py) and has the following format:

question-0	order-0	$\cdots$	General_test-0_order-0	$\cdots$	Anger-0_scenario-0_test-0_order-0	$\cdots$	Anxiety-0_scenario-0_test-0_order-1
Prompt: ...	Prompt: ...	$\cdots$		$\cdots$	Imagine...	$\cdots$	Imagine...
1. Q1	1	$\cdots$	4	$\cdots$	3	$\cdots$	3
2. Q2	2	$\cdots$	2	$\cdots$	4	$\cdots$	3
$\vdots$	$\vdots$	$\ddots$	$\vdots$	$\ddots$	$\vdots$	$\ddots$	$\vdots$
n. Qn	n	$\cdots$	3	$\cdots$	3	$\cdots$	1

You can read the columns before each column starting with order-, which contains the shuffled questions for your input.

Call your own LLM and get the results.
Fill in the blank in the file args.testing_file. Remember: No need to map the response to its original order. Our code will take care of it.

Please check example_generator.py for datailed information.

👉 Paper and Citation

For more details, please refer to our paper here.

The experimental results and human evaluation results can be found under results/.

If you find our paper&tool interesting and useful, please feel free to give us a star and cite us through:

@inproceedings{huang2024apathetic,
  author    = {Jen{-}tse Huang and
               Man Ho Lam and
               Eric John Li and
               Shujie Ren and
               Wenxuan Wang and
               Wenxiang Jiao and
               Zhaopeng Tu and
               Michael R. Lyu},
  title     = {Apathetic or Empathetic? Evaluating {LLM}s' Emotional Alignments with Humans},
  booktitle = {Advances in Neural Information Processing Systems 37},
  year      = {2024}
}