VMLU is a human-centric benchmark suite specifically designed to evaluate the general capabilities of foundational models with a focus on the Vietnamese language. This benchmark covers 58 subjects spanning four categories: STEM, Humanities, Social Sciences, and more. It ranges in difficulty from an elementary level to an advanced professional level, and tests both general knowledge and problem-solving ability. Please visit our website for more details.
We hope VMLU could help developers track the progress and analyze the important strengths/shortcomings of their models.
Download the zip file: Please visit our website:
To facilitate usage, we have organized the subject name handlers and English/Vietnamese names corresponding to 58 subjects. Questions extracted from datasets are presented in both LaTeX and non-LaTeX formats.
VMLU dataset covers 58 subjects including 10,880 multiple-choice questions and answers in Vietnamese language.
Id | Subject | Category | Number of questions |
---|---|---|---|
01 | Elementary Mathematics | STEM | 200 |
02 | Elementary Science | STEM | 200 |
03 | Middle School Biology | STEM | 188 |
04 | Middle School Chemistry | STEM | 200 |
05 | Middle School Mathematics | STEM | 119 |
06 | Middle School Physics | STEM | 200 |
07 | High School Biology | STEM | 200 |
08 | High School Chemistry | STEM | 200 |
09 | High School Mathematics | STEM | 163 |
10 | High School Physics | STEM | 200 |
11 | Applied Informatics | STEM | 200 |
12 | Computer Architecture | STEM | 200 |
13 | Computer Network | STEM | 197 |
14 | Discrete Mathematics | STEM | 182 |
15 | Electrical Engineering | STEM | 194 |
16 | Introduction to Chemistry | STEM | 197 |
17 | Introduction to Physics | STEM | 191 |
18 | Introduction to Programming | STEM | 197 |
19 | Metrology Engineer | STEM | 155 |
20 | Operating System | STEM | 200 |
21 | Statistics and Probability | STEM | 192 |
22 | Middle School Civil Education | Social Science | 196 |
23 | Middle School Geography | Social Science | 162 |
24 | High School Civil Education | Social Science | 200 |
25 | High School Geography | Social Science | 179 |
26 | Business Administration | Social Science | 192 |
27 | Ho Chi Minh Ideology | Social Science | 197 |
28 | Macroeconomics | Social Science | 200 |
29 | Microeconomics | Social Science | 200 |
30 | Principles of Marxism and Leninism | Social Science | 200 |
31 | Sociology | Social Science | 196 |
32 | Elementary History | Humanity | 195 |
33 | Middle School History | Humanity | 200 |
34 | Middle School Literature | Humanity | 192 |
35 | High School History | Humanity | 200 |
36 | High School Literature | Humanity | 200 |
37 | Administrative Law | Humanity | 100 |
38 | Business Law | Humanity | 197 |
39 | Civil Law | Humanity | 200 |
40 | Criminal Law | Humanity | 180 |
41 | Economic Law | Humanity | 178 |
42 | Education Law | Humanity | 183 |
43 | History of World Civilization | Humanity | 200 |
44 | Idealogical and Moral Cultivation | Humanity | 200 |
45 | Introduction to Laws | Humanity | 139 |
46 | Introduction to Vietnam Culture | Humanity | 200 |
47 | Logic | Humanity | 192 |
48 | Revolutionary Policy of the Vietnamese Commununist Part | Humanity | 200 |
49 | Vietnamese Language and Literature | Humanity | 192 |
50 | Accountant | Other | 186 |
51 | Clinical Pharmacology | Other | 200 |
52 | Environmental Engineering | Other | 189 |
53 | Internal Basic Medicine | Other | 189 |
54 | Preschool Pedagogy | Other | 112 |
55 | Tax Accountant | Other | 192 |
56 | Tax Civil Servant | Other | 189 |
57 | Civil Servant | Other | 189 |
58 | Driving License Certificate | Other | 189 |
Below is a non-LaTeX example from dev.jsonl:
{
"id": "51-0001",
"question": "Các phát biểu ĐÚNG về ĐỊNH NGHĨA Dược lâm sàng, NGOẠI TRỪ:",
"choices": [
'A. Là ngành khoa học về sử dụng thuốc hợp lý',
'B. Nghiên cứu phát triển kinh tế dược bệnh viện',
'C. Giúp tối ưu hóa việc sử dụng thuốc trên cơ sở về dược và y sinh học',
'D. Đối tượng chính của môn học dược lâm sàng là thuốc và người bệnh'
],
"answer": "B"
}
Below is a LaTeX example from dev.jsonl:
{
"id": "58-0006",
"question": "Trong không gian Oxyz, cho đường thẳng d:{\\frac{x-2}{1}}=\\frac{y-1}{-2}=\\frac{z+1}{3}. Điểm nào dưới đây thuộc d ?",
"choices": [
"A. Q\\left(2;{1};{1}\\right)",
"B. M\\left(1;{2};{3}\\right)",
"C. N\\left(1;{-}2;{3}\\right)",
"D. P\\left(2;{1};{-}1\\right)"
],
"answer": "D"
}
Below are zero-shot and five-shot accuracies from the models that we evaluate in the initial release, please visit our official website.
DISCLAIMER: Please note that evaluating models like LLMs can be challenging, as leaderboards might be susceptible to manipulation, and a small tweak in prompting can lead to totally different results. It's especially concerning because some models are not publicly accessible. For instance, good results can be achieved through distilling answers from stronger models like GPT-4 or even from humans. Therefore, it's important to approach leaderboard scores with caution. Most of the models assessed here are public With Open Access, which have public weights or APIs for verification.
# | Model | Organization | Base Model | Accessibility | Evaluation Date | STEM | Social Science | Humanities | Other | Average |
---|---|---|---|---|---|---|---|---|---|---|
1 | Llama-3-70B | Meta | From scratch | Weight | 23/04/2024 | 61.7 | 74.91 | 68.74 | 63.53 | 66.44 |
2 | GPT-4 | OpenAI | From scratch | API | 08/01/2024 | 63.84 | 71.78 | 66.14 | 60.37 | 65.53 |
3 | gemini | From scratch | API | 30/01/2024 | 42.80 | 60.31 | 55.35 | 51.30 | 51.03 | |
4 | ChatGPT | OpenAI | From scratch | API | 08/01/2024 | 43.31 | 51.52 | 46.88 | 46.90 | 46.39 |
5 | ViGPT-1.6B-v1 | Vin BigData | From scratch | Private | 08/01/2024 | 35.06 | 48.72 | 47.20 | 42.54 | 42.34 |
6 | gemma-7b-it | From scratch | Weight | 22/02/2024 | 39.95 | 44.93 | 43.39 | 40.11 | 41.90 | |
7 | Qwen-7B | Alibaba Cloud | From scratch | Weight | 08/01/2024 | 30.64 | 35.07 | 34.15 | 32.68 | 32.81 |
8 | gemma-2b-it | From scratch | Weight | 22/02/2024 | 24.39 | 29.59 | 31.01 | 26.81 | 27.72 | |
9 | sealion7b | AI Singapore | From scratch | Weight | 08/01/2024 | 26.28 | 28.57 | 27.66 | 27.34 | 26.73 |
10 | bloom-1b7 | BigScience | From scratch | Weight | 08/01/2024 | 25.13 | 25.09 | 26.34 | 25.19 | 25.51 |
11 | bloom-7b1 | BigScience | From scratch | Weight | 08/01/2024 | 25.08 | 26.26 | 25.74 | 24.59 | 25.41 |
12 | falcon-7b | Technology Innovation Institute | From scratch | Weight | 08/01/2024 | 24.19 | 23.59 | 26.72 | 24.73 | 24.96 |
13 | PhoGPT-7B5-Instruct | Vin AI | From scratch | Weight | 08/01/2024 | 21.97 | 25.93 | 24.32 | 26.00 | 24.01 |
14 | Llama-2-7b-hf | Facebook Research - Meta | From scratch | Weight | 08/01/2024 | 21.48 | 23.41 | 24.10 | 23.59 | 22.95 |
15 | falcon-7b-instruct | Technology Innovation Institute | From scratch | Weight | 08/01/2024 | 9.50 | 13.63 | 14.98 | 6.13 | 11.39 |
# | Model | Organization | Base Model | Accessibility | Evaluation Date | STEM | Social Science | Humanities | Other | Average |
---|---|---|---|---|---|---|---|---|---|---|
1 | VNPTAI.IO-14B | VNPT AI | Qwen1.5-14B-Chat | Private | 14/03/2024 | 51.64 | 61.75 | 58.09 | 54.51 | 55.83 |
2 | SeaLLM-7B-v2.5 | DAMO Academy | llama-2-7b | Private | 09/04/2024 | 49.35 | 60.66 | 55.95 | 49.05 | 53.30 |
3 | Vistral-7B-Chat | UONLP x Ontocord | Mistral-7B-v0.1 | Weight | 15/01/2024 | 43.43 | 57.02 | 55.12 | 48.01 | 50.07 |
4 | SDSRV-7B-chat | SDSRV Teams | Mistral-7B-v0.1 | Private | 04/05/2024 | 36.29 | 60.55 | 55.95 | 49.05 | 48.55 |
5 | Arcanic Cono 1.5 | Arcanic AI | Mistral-7B-v0.1 | API | 04/05/2024 | 42.11 | 52.44 | 51.97 | 45.36 | 47.45 |
6 | SeaLLM-7B-v2 | DAMO Academy | llama-2-7b | weight | 15/02/2024 | 39.95 | 52.03 | 49.38 | 45.27 | 45.79 |
7 | bloomz-7b1 | BigScience | bloom-7b1 | Weight | 08/01/2024 | 32.63 | 45.73 | 41.85 | 39.89 | 38.87 |
8 | T-Llama | FPTU HCM | llama-2-7b | Weight | 20/03/2024 | 32.20 | 43.15 | 40.31 | 36.57 | 37.28 |
9 | vbd-llama2-7b-50b-chat | Vin BigData | llama-2-7b | Weight | 08/01/2024 | 31.45 | 40.34 | 40.24 | 39.62 | 36.98 |
10 | vietcuna-3b | Virtual Interactive | bloomz-3b | Weight | 08/01/2024 | 30.12 | 39.92 | 37.86 | 33.83 | 34.79 |
11 | bloomz-1b7 | BigScience | bloom-1b7 | Weight | 08/01/2024 | 29.72 | 40.17 | 34.73 | 33.41 | 33.65 |
12 | SeaLLM-7B-Hybrid | DAMO Academy | llama-2-7b | Weight | 08/01/2024 | 29.49 | 34.61 | 36.68 | 34.52 | 33.39 |
13 | ura-llama-7b | Ho Chi Minh City University of Technology | llama-2-7b | Weight | 08/01/2024 | 29.19 | 33.31 | 34.64 | 32.97 | 32.18 |
14 | vinallama-7b-chat | Virtual Interactive | llama-2-7b | Weight | 08/01/2024 | 25.70 | 34.50 | 33.87 | 31.41 | 30.64 |
15 | vietcuna-7b-v3 | Virtual Interactive | bloomz-7b | Weight | 08/01/2024 | 28.70 | 33.94 | 31.32 | 28.24 | 30.34 |
16 | vietnamese-llama-7b-40GB | BKAI - HUST | llama-2-7b | Weight | 08/01/2024 | 23.22 | 25.61 | 26.71 | 26.30 | 25.19 |
Model | STEM | Social Science | Humanities | Other | Average |
---|---|---|---|---|---|
GPT-4 | TBU | TBU | TBU | TBU | TBU |
You can just follow our code
cd code_benchmark;
GPTKEY='<YOUR GPT KEY>' python test_gpt.py
You need to first prepare a UTF-8 encoded CSV file with the following format, please refer to submission_example.csv for details.
## key within each subject is the "id" field from the dataset
id,answer
41-0001,A
41-0002,A
41-0003,A
41-0004,C
41-0005,A
Then you can submit the prepared csv file here, note that you need to first log in to access the submission page.
TBU
TBU