HySonLab/ViDeBERTa

ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023

Jupyter NotebookMIT

ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023

Paper: https://aclanthology.org/2023.findings-eacl.79.pdf

Contributors

Tran Cong Dao
Pham Nhut Huy
Nguyen Tuan Anh
Hy Truong Son (Correspondent / PI)

Main components

Pre-training
Model
Fine-tuning

Pre-training

Code architecture

bash: bash scripts to run the pipeline
config: model_config (json files)
dataset: datasets folder (both store original txt dataset and the pointer to memory of datasets.load_from_disk)
source: main python files to run pre-training tokenizers
tokenizer: folder to store tokenizers

Pre-tokenizer

Split the original txt datasets into train, validation and test sets with 90%, 5%, 5%.
Using the PyVi library to segment the datasets
Save datasets to disk

Pre-train_tokenizer

Load datasets
Train the tokenizers with SentencePiece models
Save tokenizers

Pre-train_model

Load datasets
Load tokenizers
Pre-train DeBERTa-v3

Model

Fine-tuning

Code architecture

POS tagging and NER (POS_NER)
Question Answering (QA and QA2)
Open-domain Question Answering (OPQA)