Vietnamese Natural Language Processing Resources

Create a pull request or issue to add your works into this list.

Large Language Models
Corpus
Text Processing Toolkit
Pre-trained Language Model
Sentiment Analysis
Named Entity Recognition
Speech Processing

Large Language Models

GemSUra: Pretrained Large Language Models based on Gemma built by URA (HCMUT).
Ghost-7b: This model is fine tuned from HuggingFaceH4/zephyr-7b-beta on a small synthetic datasets (about 200MB) for 50% English and 50% Vietnamese.
PhoGPT: They open-source a state-of-the-art 7.5B-parameter generative model series named PhoGPT for Vietnamese, which includes the base pre-trained monolingual model PhoGPT-7B5 and its instruction-following variant PhoGPT-7B5-Instruct.
Sailor: Sailor is a suite of Open Language Models tailored for South-East Asia (SEA), focusing on languages such as 🇮🇩Indonesian, 🇹🇭Thai, 🇻🇳Vietnamese, 🇲🇾Malay, and 🇱🇦Lao.
SeaLLM): The state-of-the-art multilingual LLM for Southeast Asian (SEA) languages 🇬🇧 🇨🇳 🇻🇳 🇮🇩 🇹🇭 🇲🇾 🇰🇭 🇱🇦 🇲🇲 🇵🇭.
ToRoLaMa: The Vietnamese Instruction-Following and Chat Model.
Vistral-7B-Chat-function-calling: This model was fine-tuned on Vistral-7B-chat for function calling.
Vistral-7B-Chat: Towards a State-of-the-Art Large Language Model for Vietnamese
ViGPTQA: LLMs for Vietnamese Question Answering
VBD-LLaMA2-Chat: A Conversationally-tuned LLaMA2 for Vietnamese.
Vietnamse LLaMA 2: A 7B version of LLaMA 2 with 140GB of Vietnamese text by BKAI Foundation Models Lab.
VinaLlaMA: Another collection of Vietnamese LlaMA tuned models.
Vietcuna: A series of Vicuna tuned models for Vietnamese.
Llama2_vietnamese: A fine-tuned Large Language Model (LLM) for the Vietnamese language based on the Llama 2 model.
Vietnamese_LLMs: This project aims to create high-quality Vietnamese instruction datasets and tune several open-source large language models (LLMs). So far, they have released various models, including LLaMa and BLOOMZ. Additionally, they have released five instruction datasets, most of which were generated by GPT-4.

Corpus

For more recent updates, you can consider searching for datasets that include Vietnamese on HuggingFace here: https://huggingface.co/datasets?language=language:vi&sort=trending

VN News Corpus: 50GB of uncompressed texts crawled from a wide range ofnews websites and topics.
10000 Vietnamese Books: 10000 Vietnamese Books from 195x.
CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages
Bactrain-X: The Bactrain-X dataset is a collection of 3.4M instruction-response pairs in 52 languages.
OSCAR: 68GB of text data with 12.036.845.359 words.
Common Crawl: Open repository of web crawl data.
WikiDumps: You can download directly or use scripts from viwik18, viwik19.
Vietnamese Treebank: VLSP Project.
Vietnamese Stopwords: Vietnamese stopwords.
Vietnamese Dictionary: Vietnamese dictionary.
vietnamese-wordnet: Vietnamese wordnet.
VietnameseWAC: The dataset comprises a substantial collection of Vietnamese text, consisting of 129,781,089 tokens and 106,464,835 words, which have been automatically segmented and labeled as per Kilgarriff, A., and Le-Hong, P., 2012.
Vietlex Corpus: Vietlex's Vietnamese Corpus, a pioneering effort in Vietnam since 1998, contains about 80 million syllables from various sources.
Lexical Database of Vietnamese: A lexical database of Vietnamese contains various lexical information derived from two Vietnamese corpora.

Text Processing Toolkit

coccoc-tokenizer: High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.
RDRSegmenter: Fast and accurate Vietnamese word segmenter (LREC 2018).
RDRPOSTagger: Fast and accurate POS and morphological tagging toolkit (EACL 2014).
VnCoreNLP: A Vietnamese natural language processing toolkit (NAACL 2018).
vlp-tok: Vietnamese text processing library developed in the Scala programming language.
ETNLP: A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.
VietnameseTextNormalizer: Vietnamese Text Normalizer.
nnvlp: Neural network-based Vietnamese language processing toolkit.
jPTDP: Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).
vi_spacy: Vietnamese language model compatible with Spacy.
underthesea: Underthesea - Vietnamese NLP toolkit.
vnlp: GATE plugin for Vietnamese language processing.
pyvi: Python Vietnamese toolkit.
JVnTextPro: Java-based Vietnamese text processing tool.
DongDu: C++ implementation of Vietnamese word segmentation tool.
VLSP Toolkit: Vietnamese tokenizer from VLSP.
vTools: Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.
JNSP: Java Implementation of Ngram Statistic Package.

Pre-trained Language Model

RoBERTa Vietnamese: Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.
PhoBERT: Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).
ALBERT for Vietnamese: "A Lite" version of BERT for Vietnamese.
Vietnamese ELECTRA: Electra pre-trained model using Vietnamese corpus.
word2vecVN: Pre-trained Word2Vec models for Vietnamese.

Sentiment Analysis

Benchmark

VLSP 2016 Share Task: Sentiment Analysis

Train: 5100 sentences (1700 positive, 1700 neutral, 1700 negative).

Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

Model	F1	Paper	Code
Perceptron/SVM/Maxent	80.05	DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews
SVM/MLNN/LSTM	71.44	A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016
Ensemble: Random forest, SVM, Naive Bayes	71.22	A Lightweight Ensemble Method for Sentiment Classification Task
Ensemble: SVM, LR, LSTM, CNN	69.71	An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis
SVM	67.54	Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments
SVM/MLNN	67.23	A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign
Multi-channel LSTM-CNN	59.61	Multi-channel LSTM-CNN model for Vietnamese sentiment analysis	official

VLSP 2018 Shared Task: Aspect Based Sentiment Analysis

Restaurant Dataset: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

Model	Aspect (F1)	Aspect Polarity (F1)	Paper
CNN	0.80		Deep Learning for Aspect Detection on Vietnamese Reviews
SVM	0.77	0.61	NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
SVM	0.54	0.48	Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task

Hotel Dataset: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

Model	Aspect (F1)	Aspect Polarity (F1)	Paper
SVM	0.70	0.61	NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
CNN	0.69		Deep Learning for Aspect Detection on Vietnamese Reviews
SVM	0.56	0.53	Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task

Vietnamese Student's Feedback Corpus (UIT-VSFC)

UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

Model	Sentiment (F1)	Topic (F1)	Paper	Code
Bi-LSTM/Word2Vec	0.896	0.92	Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus
Maximum Entropy Classifier	0.88	0.84	UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis

Named Entity Recognition

Benchmark

VLSP 2016 Shared Task: Named Entity Recognition

Model	F1	Paper	Code
PhoBERT_large	94.7	PhoBERT: Pre-trained language models for Vietnamese	official
vELECTRA + BiLSTM + Attention	94.07	Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
PhoBERT_base	93.6	PhoBERT: Pre-trained language models for Vietnamese	official
XLM-R	92.0	PhoBERT: Pre-trained language models for Vietnamese
VnCoreNLP-NER + ETNLP	91.3	ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
BiLSTM-CNN-CRF + ETNLP	91.1	ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
VNER: Attentive Neural Network	89.6	Attentive Neural Network for Named Entity Recognition in Vietnamese
BiLSTM-CNN-CRF	88.3	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	official
LSTM + CRF	66.07	An investigation of Vietnamese Nested Entity Recognition Models

VLSP 2018 Shared Task: Named Entity Recognition

Model	F1	Paper
vELECTRA + BiGRU	90.31	Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
VIETNER: CRF (ngrams + word shapes + cluster + w2v)	76.63	A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign
ZA-NER	74.70	ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign

Speech Processing

Corpus:

VLSP 2020 - ASR challenge - training set: announcement, unofficial mirror link on huggingface
VIVOS: official link, mirror link on huggingface
Bud500: announcement, mirror link on huggingface
FOSD (FPT open speech dataset): official link, unofficial mirror link on huggingface
LSVSC (Large-scale Vietnamese speech corpus): announcement, unofficial mirror link on huggingface
Infore: official link, unofficial mirror link for dataset 1 on huggingface, unofficial mirror link for dataset 2 on huggingface
unofficial mirror link Vivos + InfoRe 1 + InfoRe 2
VietTTS-v1: A synthesized dataset for Vietnamese TTS task (35.1 hrs)
Mozilla CommonVoice
Google FLEURS

Project

vietTTS: Tacotron + HiFiGAN vocoder for vietnamese datasets.

quyen88/awsome-vietnamese-nlp

Vietnamese Natural Language Processing Resources

Large Language Models

Corpus

Text Processing Toolkit

Pre-trained Language Model

Sentiment Analysis

Benchmark

Named Entity Recognition

Benchmark

Speech Processing

Corpus:

Project