When LLMs Meet Cunning Texts: A Fallacy Understanding Benchmark for Large Language Models

Preliminary

We provide detailed supplementary materials, including Technical Appendices, Datasheet, Metadata, and Author Statement.

Requirements

transformers ~= 4.35.0
vllm ~= 0.2.2
openai == 0.28.0
scikit-learn
pandas
numpy
tqdm

How to run

Step 1: Model Inference

For local models (Qwen-72B-Chat, Yi-34B-Chat, Baichuan2-13B-Chat, etc.), here is an example:

python tasks.py \
    --model_name Qwen-72B-Chat \
    --tp_size 8 \
    --gpu_memory_utilization 0.9 \
    --fewshot 0

For API models (GPT-4-Turbo, ERNIE-Bot-4.0, etc.), here is another example:

python tasks.py \
    --model_name gpt-4-1106-preview \
    --is_api \
    --num_processes 32 \
    --fewshot 0

Step 2: Automatic Evaluation

Please run

python evaluation.py

Step 3: Compute Metrics

Please run

python analysis.py

All metrics would be saved to metrics.tsv.

Metadata and Data Format

The Croissant metadata of FLUB is at FLUB_croissant_metadata

The data format of FLUB is as follows:

{
  "text": "The input cunning text",
  "is_question": "Is the input cunning text a question?",
  "type": "The cunning type of the input text for the Cunning Type Classification task.",
  "explanation": "The correct explanation of the input text for the Fallacy Explanation task.",
  "id": "The id of each data sample",
  "options": {
    "A": "The candidate answer 1 for the input text (question)",
    "B": "The candidate answer 2 for the input text (question)",
    "C": "The candidate answer 3 for the input text (question)",
    "D": "The candidate answer 4 for the input text (question)"
  },
  "answer": "The correct answer for the Answer Selection (Multiple Choice) task."
}

Citation

Please consider citing this paper if you use the code or data from our work. Thanks a lot :)

@article{li2024llms,
  title={When llms meet cunning texts: A fallacy understanding benchmark for large language models},
  author={Li, Yinghui and Zhou, Qingyu and Luo, Yuanzhen and Ma, Shirong and Li, Yangning and Zheng, Hai-Tao and Hu, Xuming and Yu, Philip S},
  journal={arXiv preprint arXiv:2402.11100},
  year={2024}
}

License

The data and our code are both subject to the license of Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0).

THUKElab/FLUB