| ๐ค HuggingFace | ๐ EXAONE 3.0 7.8B Tech Report |
This is an official repository for KoMT Bench built by LG AI Research, used to evaluate Korean instruction-following capability of language models, as described in the โEXAONE 3.0 7.8B Instruction-Tuned Language Modelโ (Technical Report). KoMT Bench is developed by translating MT-Bench [1] dataset into Korean and modifying some questions to reflect the characteristics and cultural nuances of the Korean language.
All source code in this repository is based on LMSYSโs FastChat repository, and we have adapted it to implement EXAONE 3.0 7.8B model.
Here are examples from KoMT-Bench:
Category | MT-Bench | KoMT-Bench |
---|---|---|
Writing | ||
1st Turn | Imagine you are writing a blog post comparing two popular smartphone models. Develop an outline for the blog post, including key points and subheadings to effectively compare and contrast the features, performance, and user experience of the two models. Please answer in fewer than 200 words. | ๋ ๊ฐ์ ์ธ๊ธฐ ์ค๋งํธํฐ ๋ชจ๋ธ์ ๋น๊ตํ๋ ๋ธ๋ก๊ทธ ๊ฒ์๋ฌผ์ ์์ฑํ๋ค๊ณ ๊ฐ์ ํฉ๋๋ค. ๋ ๋ชจ๋ธ์ ๊ธฐ๋ฅ, ์ฑ๋ฅ, ์ฌ์ฉ์ ๊ฒฝํ์ ํจ๊ณผ์ ์ผ๋ก ๋น๊ตํ๊ณ ๋์กฐํ ์ ์๋๋ก ํต์ฌ ์ฌํญ๊ณผ ์์ ๋ชฉ์ ํฌํจํ์ฌ ๋ธ๋ก๊ทธ ๊ฒ์๋ฌผ์ ๊ฐ์๋ฅผ ์์ฑํ์ธ์. 200์ ์ด๋ด๋ก ๋ตํ์ญ์์ค. |
2nd Turn | Take your previous response and rephrase it as a limerick. | ์ด์ ๋ต๋ณ์ ์ถฉ์ฒญ๋ ์ฌํฌ๋ฆฌ๋ก ์ฌ์์ฑํ์ญ์์ค. |
Math | ||
1st Turn | When a number is divided by 10, the remainder is 4. What is the remainder when twice the number is divided by 4? | ์ด๋ค ์ซ์๋ฅผ 10์ผ๋ก ๋๋ ๋๋จธ์ง๋ 4์ ๋๋ค. ๊ทธ ์ซ์์ ๋ ๋ฐฐ๋ฅผ 4๋ก ๋๋ ๋๋จธ์ง๋ฅผ ๊ตฌํ์ธ์. |
2nd Turn | What about when twice the number is divided by 5? | ๊ทธ ์ซ์์ ๋ ๋ฐฐ๋ฅผ 5๋ก ๋๋๋ฉด ์ด๋จ๊น์? |
Humanities | ||
1st Turn | Provide insights into the correlation between economic indicators such as GDP, inflation, and unemployment rates. Explain how fiscal and monetary policies affect those indicators. | GDP, ์ธํ๋ ์ด์ , ์ค์ ๋ฅ ๊ณผ ๊ฐ์ ๊ฒฝ์ ์งํ ๊ฐ์ ์๊ด๊ด๊ณ์ ๋ํ ํต์ฐฐ์ ์ ์ํ์ธ์. ์ด๋ฌํ ์งํ๋ค์ ์ฌ์ ๋ฐ ํตํ ์ ์ฑ ์ด ์ด๋ค ์ํฅ์ ๋ฏธ์น๋์ง ์ค๋ช ํ์ธ์. |
2nd Turn | Now, explain them again like I'm five. | ์ด์ ์ ๊ฐ 5์ด์ด๋ผ ์๊ฐํ๊ณ ๋ค์ ์ค๋ช ํด ์ฃผ์ธ์. |
git clone https://github.com/LG-AI-EXAONE/KoMT-Bench
cd FastChat
bash setting.sh
export OPENAI_API_KEY=<openai key>
export PYTHONPATH=${PYTHONPATH}:${PWD}
- Note that we have implemented a square root penalty for non-Korean responses, which requires to detect what language each response is written in. To accomplish this, we utilize a language detector developed by Google. We recommend users check that the detector has been installed correctly in the following path:
./fastchat/llm_judge/data/mt_bench/model_judgment/language_detector.tflite
bash run_ko_mt.sh
- To evaluate the EXAONE model with a single line of code, just execute the command above.
To run each step separately, please follow the steps below:
cd fastchat/llm_judge/
# Generating Model Answers
CUDA_VISIBLE_DEVICES=0 python gen_model_answer.py \
--model-path <model_name or model_path> \
--model-id <model_id> \
--dtype bfloat16
- To evaluate the EXAONE model,
<model_name or model_path>
must include the keywordexaone
.
cd fastchat/llm_judge/
# Assessing Nodel Answers through a LLM-as-a-judge (here, "gpt-4-0613" is used)
python gen_judgment.py --model-list <model_id>
# Giving a penalty to the score of non-Korean responses
cd data/mt_bench/model_judgment
python detector.py --model_id <model_id>
cd fastchat/llm_judge/
# Getting Results
python show_result.py --mode single --input-file <file_path>
- Please put
./data/mt_bench/model_judgment/<model_id>_single_final.jsonl
into<file_path>
to obtain the language-penalized results (default). - If you want to see the unpenalized results, put
./data/mt_bench/model_judgment/<model_id>_single.jsonl
into<file_path>
instead.
Here are the evaluation results of various language models including the EXAONE 3.0 7.8B instruction-tuned model on KoMT-Bench. Please refer to EXAONE 3.0 technical report for details.
EXAONE 3.0 7.8B Inst. | Llama 3.1 8B Inst. | Gemma 2 9B Inst. | QWEN 2 7B Inst. | Phi 3 7B Inst. | Mistral 7B Inst. | |
---|---|---|---|---|---|---|
KoMT-Bench | 8.92 | 6.06 | 7.92 | 7.69 | 4.87 | 5.20 |
We found that, even when generated responses were in a language other than Korean, GPT-4-0613, acting as the judge, continued to award high scores. To handle such cases, we adopt a square root penalty which applies the square root to the score of non-Korean responses in order to adjust for this discrepancy.
By applying the square root penalty, the range of score for non-Korean responses falls within
[1] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph E Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 46595โ46623. Curran Associates, Inc., 2023.
@misc{KoMT-Bench,
author = {LG AI Research},
title = {KoMT-Bench},
year = {2024},
publisher = {Hugging Face},
journal = {Hugging Face repository},
howpublished = {\url{https://huggingface.co/datasets/LGAI-EXAONE/KoMT-Bench}}
}