- Paper: InFoBench: Evaluating Instruction Following Ability in Large Language Models
- Dataset: InFoBench Dataset
- Generation and Annotation: InFoBench Generation and Annotation
@article{qin2024infobench,
title={InFoBench: Evaluating Instruction Following Ability in Large Language Models},
author={Yiwei Qin and Kaiqiang Song and Yebowen Hu and Wenlin Yao and Sangwoo Cho and Xiaoyang Wang and Xuansheng Wu and Fei Liu and Pengfei Liu and Dong Yu},
year={2024},
eprint={2401.03601},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
You can directly download it with huggingface datasets.
from datasets import load_dataset
dataset = load_dataset("kqsong/InFoBench")
Provide an output file in model/output.json
.
Each data entry should be a json object with a newline, containing all the fields in the input format.
The generated response should be included in the json object with the new field named output
.
We suggest using greedy decoding to avoid the randomness of decoding.
Evaluate LLM's outputs on decomposed questions. Using GPT-4-0314 by default in this research.
python evaluation.py \
--api_key <OPENAI KEY> \
--eval_model gpt-4-0314 \
--input model/output.json \
--output_dir evaluation/ \
--temperature 0
Each data entry will include an "eval" key in the format of List[bool]
which represents "Yes" or "No" answers to each decomposed question.
The final output evaluation file will be saved in JSON format at location <output_dir>/<eval_model>/
.