/KoCommonGEN-V2

KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models

Primary LanguagePython

🌠 KoCommonGEN v2

KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models [ACL 2024-Findings]

Jaehyung Seo, Jaewook Lee, Chanjun Park, SeongTae Hong, Seungjun Lee and Heuiseok Lim

🏫 NLP & AI Lab, Korea University


πŸ”₯ News

  • September 27, 2023: Provided data support for the Open Ko-LLM Leaderboard
  • August 7, 2024: Dataset Release
  • August 10, 2024: Experimental Results for the New Models Added
  • August 14, 2024: Presented a research paper at ACL 2024

πŸ“Š Dataset

The KoCommonGEN v2 dataset is available on Hugging Face:

You can easily access and use these datasets for your research and experiments.

πŸ› οΈ Installation

This repository partially adopts the evaluation methods of version 0.3.0 of EleutherAI/lm-eval-harness for the evaluation of KoCommonGEN v2

$ git clone https://github.com/J-Seo/KoCommonGEN-V2.git
# python_requires >=3.9
$ cd KoCommonGEN_v2
$ pip install -r requirements.txt 

πŸš€ Usage

The maximum number of few-shot examples currently uploaded is 5. Users can freely add more to increase --num_fewshot

$ sh test.sh
## test.sh
python3 main.py \ 
--model hf-causal-experimental \
--model_args pretrained="nlpai-lab/KULLM3" \
--task ko_commongen_v2 \
--device cuda:1 \
--num_fewshot 2 \
--batch_size 1 \
--output nlpai-lab/KULLM3 &

You can also use sequence-to-sequence models.

## test.sh
python3 main.py \
--model hf-seq2seq \
--model_args pretrained="google/flan-t5-xxl" \
--task ko_commongen_v2 \
--device cuda:1 \
--num_fewshot 2 \
--batch_size 1 \
--output google/flan-t5-xxl &

πŸ‘₯ Human Evaluation

We recruited 22 native Korean speaking volunteers as human evaluators and paid them $0.8 per question.

Model # Average Score cohen's kappa Krippendorff's alpha
Human 22 0.8395 0.7693 0.7706

πŸ€– Models (August 10, 2024)

The results of 2-shot evaluation of the newly released models.

Model Size Acc_norm Stderr Link
GPT-4 (June 13, 2023) 0.7450
Mistral-Nemo-Instruct 12B 0.6612 0.0163 πŸ”—
Mistral-Nemo-Base 12B 0.6340 0.0166 πŸ”—
Meta-Llama-3.1-8B 8B 0.6246 0.0166 πŸ”—
QWEN2-7B base 7B 0.6187 0.0167 πŸ”—
EXAONE-3.0-7.8B-Instruct 7.8B 0.6088 0.0168 πŸ”—
MLP-KTLim-Bllossom-8B 8B 0.6057 0.0168 πŸ”—
Meta-Llama-3.1-8B-Instruct 8B 0.6057 0.0168 πŸ”—
KULLM3 10.8B 0.6033 0.0168 πŸ”—
QWEN2-7B inst 7B 0.5832 0.017 πŸ”—
Gemma-2-9b-it 9B 0.5714 0.0170 πŸ”—
Aya-23-8B 8B 0.5159 0.0172 πŸ”—
Allganize-Alpha-Instruct 8B 0.4970 0.0172 πŸ”—

As mentioned in the paper, it is possible to evaluate various models.

πŸ‡°πŸ‡·πŸ‡ΊπŸ‡ΈπŸ‡―πŸ‡΅πŸ‡¨πŸ‡³πŸ‡ͺπŸ‡Έ Code-switching

The multilingual dataset consists of 99 samples for numerical commonsense reasoning, which were created relying on machine translation.

The dataset can be found at the following path: lm_eval/datasets/ko_commongen_v2/shuffled_$LANG$_1.0.jsonl.

You can also access the code-switching dataset on Hugging Face: nlpai-lab/ko_commongen_v2_code_switching

(The code-switching data relies on machine translation, which may result in some inaccuracies.)

If you intend to use it for evaluation, you should modify the prompt and file path in lm_eval/tasks/ko_commongen_v2.py.

πŸ“– Citation

@inproceedings{seo2024kocommongen,
  title={KoCommonGEN v2: A Benchmark for Navigating Korean Commonsense Reasoning Challenges in Large Language Models},
  author={Seo, Jaehyung and Lee, Jaewook and Park, Chanjun and Hong, SeongTae and Lee, Seungjun and Lim, Heui-Seok},
  booktitle={Findings of the Association for Computational Linguistics ACL 2024},
  pages={2390--2415},
  year={2024}
}

🚨 Warning!

This dataset contains some instances of toxic speech.

πŸ™ Acknowledgement

We sincerely appreciate the dedication of Chanjun Park, Sanghoon Kim and Sunghun Kim (Sung Kim) from Upstage AI in managing one of the benchmark datasets for the Open Ko-LLM LeaderBoard.