Vietnamese Natural Language Processing Resources

Create a pull request to add your works into this list.

VN News Corpus: 50GB of uncompressed texts crawled from a wide range ofnews websites and topics.
VNESEcorpus: 650.000 sentences from vietnamnet.vn, dantri.com.vn, nhandan.com.vn.
VNTQcorpus(samll): 300.000 sentences from vnthuquan.net.
VNTQcorpus(big): 1.750.000 sentences from vnthuquan.net.
OSCAR: 68GB of text data with 12.036.845.359 words.
Common Crawl: Open repository of web crawl data.
WikiDumps: You can download directly or use scripts from viwik18, viwik19.
Vietnamese Treebank: VLSP Project.
Vietnamese Stopwords: Vietnamese stopwords.
Vietnamese Dictionary: Vietnamese dictionary.
vietnamese-wordnet: Vietnamese wordnet.

coccoc-tokenizer: High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.
RDRSegmenter: Fast and accurate Vietnamese word segmenter (LREC 2018).
RDRPOSTagger: Fast and accurate POS and morphological tagging toolkit (EACL 2014).
VnCoreNLP: A Vietnamese natural language processing toolkit (NAACL 2018).
vlp-tok: Vietnamese text processing library developed in the Scala programming language.
ETNLP: A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.
VietnameseTextNormalizer: Vietnamese Text Normalizer.
nnvlp: Neural network-based Vietnamese language processing toolkit.
jPTDP: Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).
vi_spacy: vietnamese language model compatible with Spacy.
underthesea: Underthesea - Vietnamese NLP toolkit.
vnlp: GATE plugin for Vietnamese language processing.
pyvi: Python Vietnamese toolkit.
JVnTextPro: Java-based Vietnamese text processing tool.
DongDu: C++ implementation of Vietnamese word segmentation tool.
VLSP Toolkit: Vietnamese tokenizer from VLSP.
vTools: Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.
JNSP: Java Implementation of Ngram Statistic Package.

RoBERTa Vietnamese: Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.
PhoBERT: Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).
ALBERT for Vietnamese: "A Lite" version of BERT for Vietnamese.
Vietnamese ELECTRA: Electra pre-trained model using Vietnamese corpus.
word2vecVN: Pre-trained Word2Vec models for Vietnamese.

Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

Model	F1	Paper	Code
Perceptron/SVM/Maxent	80.05	DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews
SVM/MLNN/LSTM	71.44	A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016
Ensemble: Random forest, SVM, Naive Bayes	71.22	A Lightweight Ensemble Method for Sentiment Classification Task
Ensemble: SVM, LR, LSTM, CNN	69.71	An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis
SVM	67.54	Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments
SVM/MLNN	67.23	A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign
Multi-channel LSTM-CNN	59.61	Multi-channel LSTM-CNN model for Vietnamese sentiment analysis	official

Restaurant Dataset: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

Model	Aspect (F1)	Aspect Polarity (F1)	Paper
CNN	0.80		Deep Learning for Aspect Detection on Vietnamese Reviews
SVM	0.77	0.61	NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
SVM	0.54	0.48	Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task

Hotel Dataset: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

Model	Aspect (F1)	Aspect Polarity (F1)	Paper
SVM	0.70	0.61	NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
CNN	0.69		Deep Learning for Aspect Detection on Vietnamese Reviews
SVM	0.56	0.53	Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task

UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

Model	Sentiment (F1)	Topic (F1)	Paper	Code
Bi-LSTM/Word2Vec	0.896	0.92	Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus
Maximum Entropy Classifier	0.88	0.84	UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis

Model	F1	Paper	Code
PhoBERT_large	94.7	PhoBERT: Pre-trained language models for Vietnamese	official
vELECTRA + BiLSTM + Attention	94.07	Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
PhoBERT_base	93.6	PhoBERT: Pre-trained language models for Vietnamese	official
XLM-R	92.0	PhoBERT: Pre-trained language models for Vietnamese
VnCoreNLP-NER + ETNLP	91.3	ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
BiLSTM-CNN-CRF + ETNLP	91.1	ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
VNER: Attentive Neural Network	89.6	Attentive Neural Network for Named Entity Recognition in Vietnamese
BiLSTM-CNN-CRF	88.3	VnCoreNLP: A Vietnamese Natural Language Processing Toolkit	official
LSTM + CRF	66.07	An investigation of Vietnamese Nested Entity Recognition Models

Model	F1	Paper
vELECTRA + BiGRU	90.31	Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
VIETNER: CRF (ngrams + word shapes + cluster + w2v)	76.63	A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign
ZA-NER	74.70	ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign

trungngonptit/awsome-vietnamese-nlp