/kodialogbench

Code and data for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" (LREC-COLING 2024)

Primary LanguagePythonMIT LicenseMIT

KoDialogBench

This is the official repository for "KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark" accepted at LREC-COLING 2024.

Data

KoDialogBench is a benchmark designed to assess the conversational capabilities of language models in Korean language. To this end, we collected native Korean dialogues on daily topics from public sources (e.g., AI Hub), or translated dialogues from other languages such as English and Chinese. We then structured these conversations into diverse test datasets, spanning from dialogue comprehension to response selection tasks. This benchmark consists of 21 test sets, encompassing various aspects of open-domain colloquial dialogues (e.g., topic, emotion, dialog act).

We uploaded the datasets on ๐Ÿค—Hugging Face Hub.

Sources

We collected native Korean dialogues from AI Hub:

  • K-SNS stands for Korean SNS (ํ•œ๊ตญ์–ด SNS)
  • K-TDD stands for Thematic Daily Dialogues (์ฃผ์ œ๋ณ„ ํ…์ŠคํŠธ ์ผ์ƒ ๋Œ€ํ™” ๋ฐ์ดํ„ฐ)
  • K-ED stands for Emotional Dialogues (๊ฐ์„ฑ ๋Œ€ํ™” ๋ง๋ญ‰์น˜)
  • K-DS stands for Dialogue Summary (ํ•œ๊ตญ์–ด ๋Œ€ํ™” ์š”์•ฝ)

We translated public datasets from other languages:

Statistics

The dataset has 82,962 examples in total.

Task Subtask Source # Options # Examples
Dialogue Comprehension Topic Classification K-SNS 6 1200
Dialogue Comprehension Topic Classification K-TDD 19 1900
Dialogue Comprehension Topic Classification SocialDial 4 400
Dialogue Comprehension Emotion Recognition K-ED 6 1200
Dialogue Comprehension Emotion Recognition DailyDialog 5 470
Dialogue Comprehension Emotion Recognition Empathetic Dialogues 2 2000
Dialogue Comprehension Relation Classification SocialDial (Distance) 4 524
Dialogue Comprehension Relation Classification SocialDial (Relation) 3 330
Dialogue Comprehension Location Classification SocialDial 4 376
Dialogue Comprehension Dialog Act Classification K-TDD 4 520
Dialogue Comprehension Dialog Act Classification DailyDialog 4 1000
Dialogue Comprehension Fact Identification K-DS 4 1200
Dialogue Comprehension Fact Identification PersonaChat 4 1000
Dialogue Comprehension Fact Identification Empathetic Dialogues 4 2394
Response Selection K-SNS 5 10295
Response Selection K-TDD 5 10616
Response Selection K-ED 5 17818
Response Selection PersonaChat 5 7801
Response Selection DailyDialog 5 6740
Response Selection Empathetic Dialogues 5 7941
Response Selection SocialDial 5 7237

Usage

lm-evaluation-harness is used for zero-shot and few-shot evaluation.

TODO: merge the KoDialogBench task to lm-evaluation-harness

Installation

Install lm-eval first before cloning this repo.

git clone https://github.com/EleutherAI/lm-evaluation-harness.git
cd lm-evaluation-harness
python -m venv venv
pip install -e .
pip install -e ".[multilingual]"
pip install sentencepiece

Task registration

After cloning this repo, copy task configs to lm-eval

cp -r kodialogbench ../lm-evaluation-harness/lm_eval/tasks

Evaluation

You can evaluate the subsets using the following arguments to --tasks:

  • kodialogbench_dc: 14 dialogue comprehension tasks
  • kodialogbench_rs: 7 response selection tasks
  • kodialogbench_dc_topic: 3 topic classification tasks
  • kodialogbench_dc_emotion: 3 emotion classification tasks
  • kodialogbench_dc_relation: 2 relation classification tasks
  • kodialogbench_dc_dialog_act: 2 dialog act classification tasks
  • kodialogbench_dc_fact: 3 fact identification tasks
lm_eval --model hf \
    --model_args pretrained=EleutherAI/polyglot-ko-1.3b \
    --tasks kodialogbench \
    --device cuda:0 \
    --batch_size auto \
    --num_fewshot 0

If you want to change prompts, modify doc_to_text functions in utils.py.

Limitations

Our benchmark may suffer from a chronic problem of benchmark contamination. Due to the scarcity of Korean language resources, there is a possibility that the held-out sources utilized to construct the benchmark might overlap with training data used for some language models.

Ethics Statement

Our benchmark dataset is designed to assess capabilities related to various situations and aspects of conversations in Korean language. To achieve this, we utilized conversational content from publicly available datasets from various sources, either without modification or with translation if necessary. During this process, there is a possibility that harmful content or inappropriate biases existing in the original data may have been conveyed, or may have arisen due to limitations of translation tools. We reject any form of violence, discrimination, or offensive language, and our benchmark dataset and experimental results does not represent such values. If any harmful content or privacy infringement is identified within the dataset, we kindly request immediate notification to the authors. In the event of such cases being reported, we will apply the highest ethical standards and take appropriate actions.

Citation

@misc{jang2024kodialogbench,
      title={KoDialogBench: Evaluating Conversational Understanding of Language Models with Korean Dialogue Benchmark}, 
      author={Seongbo Jang and Seonghyeon Lee and Hwanjo Yu},
      year={2024},
      eprint={2402.17377},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Point of Contact

Seongbo Jang