/KoMT-Bench

Official repository for KoMT-Bench built by LG AI Research

Primary LanguagePythonGNU Lesser General Public License v3.0LGPL-3.0

KoMT-Bench

| ๐Ÿค— HuggingFace | ๐Ÿ“‘ EXAONE 3.0 7.8B Tech Report |


Introduction

This is an official repository for KoMT Bench built by LG AI Research, used to evaluate Korean instruction-following capability of language models, as described in the โ€œEXAONE 3.0 7.8B Instruction-Tuned Language Modelโ€ (Technical Report). KoMT Bench is developed by translating MT-Bench [1] dataset into Korean and modifying some questions to reflect the characteristics and cultural nuances of the Korean language.

All source code in this repository is based on LMSYSโ€™s FastChat repository, and we have adapted it to implement EXAONE 3.0 7.8B model.


Here are examples from KoMT-Bench:


Category MT-Bench KoMT-Bench
Writing
1st Turn Imagine you are writing a blog post comparing two popular smartphone models. Develop an outline for the blog post, including key points and subheadings to effectively compare and contrast the features, performance, and user experience of the two models. Please answer in fewer than 200 words. ๋‘ ๊ฐœ์˜ ์ธ๊ธฐ ์Šค๋งˆํŠธํฐ ๋ชจ๋ธ์„ ๋น„๊ตํ•˜๋Š” ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์„ ์ž‘์„ฑํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•ฉ๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ์˜ ๊ธฐ๋Šฅ, ์„ฑ๋Šฅ, ์‚ฌ์šฉ์ž ๊ฒฝํ—˜์„ ํšจ๊ณผ์ ์œผ๋กœ ๋น„๊ตํ•˜๊ณ  ๋Œ€์กฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ต์‹ฌ ์‚ฌํ•ญ๊ณผ ์†Œ์ œ๋ชฉ์„ ํฌํ•จํ•˜์—ฌ ๋ธ”๋กœ๊ทธ ๊ฒŒ์‹œ๋ฌผ์˜ ๊ฐœ์š”๋ฅผ ์ž‘์„ฑํ•˜์„ธ์š”. 200์ž ์ด๋‚ด๋กœ ๋‹ตํ•˜์‹ญ์‹œ์˜ค.
2nd Turn Take your previous response and rephrase it as a limerick. ์ด์ „ ๋‹ต๋ณ€์„ ์ถฉ์ฒญ๋„ ์‚ฌํˆฌ๋ฆฌ๋กœ ์žฌ์ž‘์„ฑํ•˜์‹ญ์‹œ์˜ค.
Math
1st Turn When a number is divided by 10, the remainder is 4. What is the remainder when twice the number is divided by 4? ์–ด๋–ค ์ˆซ์ž๋ฅผ 10์œผ๋กœ ๋‚˜๋ˆˆ ๋‚˜๋จธ์ง€๋Š” 4์ž…๋‹ˆ๋‹ค. ๊ทธ ์ˆซ์ž์˜ ๋‘ ๋ฐฐ๋ฅผ 4๋กœ ๋‚˜๋ˆˆ ๋‚˜๋จธ์ง€๋ฅผ ๊ตฌํ•˜์„ธ์š”.
2nd Turn What about when twice the number is divided by 5? ๊ทธ ์ˆซ์ž์˜ ๋‘ ๋ฐฐ๋ฅผ 5๋กœ ๋‚˜๋ˆ„๋ฉด ์–ด๋–จ๊นŒ์š”?
Humanities
1st Turn Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies affect those indicators. GDP, ์ธํ”Œ๋ ˆ์ด์…˜, ์‹ค์—…๋ฅ ๊ณผ ๊ฐ™์€ ๊ฒฝ์ œ ์ง€ํ‘œ ๊ฐ„์˜ ์ƒ๊ด€๊ด€๊ณ„์— ๋Œ€ํ•œ ํ†ต์ฐฐ์„ ์ œ์‹œํ•˜์„ธ์š”. ์ด๋Ÿฌํ•œ ์ง€ํ‘œ๋“ค์— ์žฌ์ • ๋ฐ ํ†ตํ™” ์ •์ฑ…์ด ์–ด๋–ค ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”์ง€ ์„ค๋ช…ํ•˜์„ธ์š”.
2nd Turn Now, explain them again like I'm five. ์ด์ œ ์ œ๊ฐ€ 5์‚ด์ด๋ผ ์ƒ๊ฐํ•˜๊ณ  ๋‹ค์‹œ ์„ค๋ช…ํ•ด ์ฃผ์„ธ์š”.

Setting & Installation

git clone https://github.com/LG-AI-EXAONE/KoMT-Bench
cd FastChat
bash setting.sh
export OPENAI_API_KEY=<openai key>
export PYTHONPATH=${PYTHONPATH}:${PWD}
  • Note that we have implemented a square root penalty for non-Korean responses, which requires to detect what language each response is written in. To accomplish this, we utilize a language detector developed by Google. We recommend users check that the detector has been installed correctly in the following path: ./fastchat/llm_judge/data/mt_bench/model_judgment/language_detector.tflite

Run

bash run_ko_mt.sh
  • To evaluate the EXAONE model with a single line of code, just execute the command above.

To run each step separately, please follow the steps below:

1. Generating model Answer

cd fastchat/llm_judge/

# Generating Model Answers
CUDA_VISIBLE_DEVICES=0 python gen_model_answer.py \
		--model-path <model_name or model_path> \
		--model-id <model_id> \
		--dtype bfloat16 
  • To evaluate the EXAONE model, <model_name or model_path> must include the keyword exaone.

2. Evaluating model Answer

cd fastchat/llm_judge/

# Assessing Nodel Answers through a LLM-as-a-judge (here, "gpt-4-0613" is used)
python gen_judgment.py --model-list <model_id>
		

# Giving a penalty to the score of non-Korean responses
cd data/mt_bench/model_judgment
python detector.py --model_id <model_id>
		

3. Show Result

cd fastchat/llm_judge/

# Getting Results
python show_result.py --mode single --input-file <file_path>
  • Please put ./data/mt_bench/model_judgment/<model_id>_single_final.jsonl into <file_path> to obtain the language-penalized results (default).
  • If you want to see the unpenalized results, put ./data/mt_bench/model_judgment/<model_id>_single.jsonl into <file_path> instead.

Results

Here are the evaluation results of various language models including the EXAONE 3.0 7.8B instruction-tuned model on KoMT-Bench. Please refer to EXAONE 3.0 technical report for details.

EXAONE 3.0 7.8B Inst. Llama 3.1 8B Inst. Gemma 2 9B Inst. QWEN 2 7B Inst. Phi 3 7B Inst. Mistral 7B Inst.
KoMT-Bench 8.92 6.06 7.92 7.69 4.87 5.20

Notice

Why penalize scores for non-Korean response?

We found that, even when generated responses were in a language other than Korean, GPT-4-0613, acting as the judge, continued to award high scores. To handle such cases, we adopt a square root penalty which applies the square root to the score of non-Korean responses in order to adjust for this discrepancy.

By applying the square root penalty, the range of score for non-Korean responses falls within $[1,\sqrt{10}]$. It's worth noting that we do not apply this penalty to questions 138 and 140, as their potential responses could be non-Korean.


References

[1] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 46595โ€“46623. Curran Associates, Inc., 2023.


Citation

@misc{KoMT-Bench,
  author = {LG AI Research},
  title = {KoMT-Bench},
  year = {2024},
  publisher = {Hugging Face},
  journal = {Hugging Face repository},
  howpublished = {\url{https://huggingface.co/datasets/LGAI-EXAONE/KoMT-Bench}}
}