/vietpunc

Vietnamese Punctuation Prediction using Pretrained Language Models

Primary LanguageJupyter Notebook


VNPunc: Vietnamese Punctuation Prediction using Pretrained Language Models

Fine-tune a variety of pre-trained Transformer-based models to solve Vietnamese Punctuation Prediction task.

About The Project

In this project, we utilize the effectiveness of the different pre-trained language models such as viELECTRA, viBERT, XLM-RoBERTa to restore seven common punctuation marks in Vietnamese.

We also stack a LSTM layer and CRF layer on the top of output representations. This contributions achieve a significant improvement over the previous models.

Prerequisites

To reproduce the experiments of our model, please install the requirements.txt according to the following instructions:

  • transformers==4.16.2
  • pytorch==1.10.0
  • python==3.7
pip install -r requirements.txt

Data

We also include Vietnamese novel and news dataset in this project. Thanks to this work for providing these datasets.

Instructions

Training

python3 run_train_punc.py --model_name_or_path=bert-base-multilingual-cased \
                            --model_arch lstm_crf \
                            --model_type bert \
                            --data_dir=data/News \ 
                            --output_dir=outputs \ 
                            --task_name=punctuation_prediction \
                            --max_seq_length=190 \
                            --do_train \
                            --do_eval  \ 
                            --eval_on=test \
                            --train_batch_size=32

Contact

Hieu Tran - heraclex12@gmail.com

Code for paper An Efficient Transformer-Based Model for Vietnamese Punctuation Prediction

Citation

@InProceedings{10.1007/978-3-030-79463-7_5,
      author="Tran, Hieu
      and Dinh, Cuong V.
      and Pham, Quang
      and Nguyen, Binh T.",
      editor="Fujita, Hamido
      and Selamat, Ali
      and Lin, Jerry Chun-Wei
      and Ali, Moonis",
      title="An Efficient Transformer-Based Model for Vietnamese Punctuation Prediction",
      booktitle="Advances and Trends in Artificial Intelligence. From Theory to Practice",
      year="2021",
      publisher="Springer International Publishing",
      address="Cham",
      pages="47--58",
      isbn="978-3-030-79463-7"
}

Acknowledgements