/ViGLUE

Primary LanguagePythonThe UnlicenseUnlicense

ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models

Author: Minh-Nam Tran, Phu-Vinh Nguyen, Long HB Nguyen, Dien Dinh.


This repository provides the source code of the publication ViGLUE: A Vietnamese General Language Understanding Benchmark and Analysis of Vietnamese Language Models.

Introduction

ViGLUE is a Vietnamese General Language Understanding Benchmark that consists of 12 tasks, including 5 natural language inference tasks, 3 sentiment analysis tasks, 2 similarity and paraphrase tasks, and 2 single-sentence tasks. The dataset covers a wide range of domains, such as miscellaneous, Wikipedia, news, fiction books, movie reviews, student feedback, social media, and Quora QA. The dataset is constructed by translating English datasets, collecting Vietnamese data, and constructing new datasets.

Dataset

The dataset is available at HuggingFace - ViGLUE, and the Table below shows the statistics of the dataset for each task, including the train/valid/test split, the copnstructing method, the evaluation metric, and the domain of the data.

Corpus Train Validation Test Method Metric Domain
Natural Language Inference Tasks
MNLI 392,702 9,815 9,796 Translating Acc. Miscellaneous
QNLI 104,743 5,463 5,463 Translating Acc. Wikipedia
RTE 2,490 277 3,000 Translating Acc. Miscellaneous
VNRTE 12,526 3,137 - Constructing Acc. News
WNLI 635 71 146 Translating Acc. Fiction books
Sentiment Analysis Tasks
SST2 67,349 872 1,821 Translating Acc. Movie reviews
VSFC 11,426 1,538 3,166 Collecting Acc. Student feedback
VSMEC 5,548 686 693 Collecting Acc. Social media
Similarity and Paraphrase Tasks
MRPC 3,668 408 1,725 Translating Acc./F1 News
QQP 363,846 40,430 390,965 Translating Acc./F1 Quora QA
Single-Sentence Tasks
CoLA 8,551 1,043 1,063 Translating MCC Miscellaneous
VToC 7,293 1,831 - Constructing Acc. News

Installation

To run the project, please use Python 3.10 and install the required packages by running the following command:

pip install -r requirements.txt

Usage

This section provides the instructions to train and evaluate the models on the ViGLUE dataset.

Evaluating Encoder-only Transformer Models

File run_glue.py is used to evaluate the encoder-only transformer models, such as PhoBERT, mBERT, XLM-R, and BERT. To train the models, please run the following command:

python run_glue.py \
    --model_name_or_path FacebookAI/xlm-roberta-base \
    --dataset_name tmnam20/ViGLUE \
    --task_name vsfc \
    --do_train \
    --do_eval \
    --max_seq_length 256 \
    --per_device_train_batch_size 32 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-5 \
    --num_train_epochs 3 \
    --output_dir ./output \
    --evaluation_strategy steps \
    --eval_steps 500 \
    --logging_steps 50 \
    --save_steps 500 \
    --save_total_limit 1 \
    --cache_dir ./cache

where --model_name_or_path is the name of the model, --task_name is one of the 12 tasks in ViGLUE, and --output_dir is the directory to save the model. The batch size, learning rate, and number of epochs can be adjusted to fit the GPU memory and the training time through the arguments --per_device_train_batch_size, --learning_rate, and --num_train_epochs, respectively.

Evaluating Generative Language Models

To evaluate the generative language models, such as BLOOM, PhoGPT, Llama, please run the following command:

python eval_gen.py \
    --model_name_or_path vinai/PhoGPT-7B5 \
    --dataset_name tmnam20/ViGLUE \
    --prompt_type en \
    --task_name qqp \
    --batch_size 1 \
    --cache_dir ./cache \
    --output_filename phogpt-7b5_en-4.json \
    --evaluate_validation 1 \
    --n_shots 4 \
    --output_dir eval_results/en/phogpt

where --model_name_or_path is the name of the large language model, --task_name is one of the 12 tasks in ViGLUE or a list of tasks separated by commas, prompt type is en for English prompt (just support en for now), --output_dir is the directory to save the evaluation results, --n_shots is the number of shots for the few-shot learning evaluation, and --output_filename is the name of the output file to write the evaluation results.

Note that to evaluate BLOOM 176B model, we use HuggingFace API endpoint, so please make sure that you have an account and set the API key by running the following command:

export HF_TOKEN=<YOUR_HF_TOKEN>

python eval_api.py \
    --model_name_or_path bigscience/bloom \
    --dataset_name tmnam20/ViGLUE \
    --prompt_type en \
    --task_name vnrte,qnli,mnli,qqp \
    --batch_size 1 \
    --cache_dir /media/ddien/cache \
    --output_filename bloom-176b_en.json \
    --evaluate_validation 1 \
    --n_shots 0,1,2,4 \
    --output_dir eval_results/en/bloom

Results

The table below shows the results of multilingual language models on the ViGLUE dataset:

Model MNLI QNLI RTE VNRTE WNLI SST2 VCSFC VSMEC MRPC QQP CoLA VToC
Metric Acc. Acc. Acc. Acc. Acc. Acc. Acc. Acc. Acc./F1 Acc./F1 MCC Acc.
ZeroR 35.45 50.54 52.71 53.11 56.34 50.92 50.85 31.20 68.38/81.22 63.38/0.00 0.00 6.77
mBERT 79.66 89.11 70.76 99.97 56.34 88.42 93.62 53.64 85.29/88.85 89.12/85.16 14.13 81.43
XLM-R_base 81.61 88.17 62.45 100.00 56.34 89.45 94.95 55.25 83.82/88.26 89.46/85.87 3.64 83.07
XLM-R_large 35.45 91.23 67.51 100.00 54.93 90.14 95.39 37.9 88.24/91.64 90.48/87.22 0.0 87.82
mDeBERTaV3 83.34 89.99 69.31 99.97 56.34 89.79 95.07 55.39 86.52/90.05 89.98/86.69 19.62 80.88

For the results of Vietnamese pretrained embedding models, please refer to the paper.

The results of generative language models, which compare BLOOM 7B1 and PhoGPT-7B, are shown in the following Figure:

Generative Language Models

Contact

If you have any questions, please contact me at tmnam20@apcs.fitus.edu.vn or trminhnam20082002@gmail.com.

Acknowledgements

I would like to thank my advisors, Assoc. Prof. Dinh Dien and Dr. Nguyen Hong Buu Long, for their guidance and support, and my colleague, Phu-Vinh Nguyen, for his support in the project.