/awsome-vietnamese-nlp

A collection of Vietnamese Natural Language Processing resources.

Vietnamese Natural Language Processing Resources

Create a pull request to add your works into this list.

Corpus

Text Processing Toolkit

  • coccoc-tokenizer: High performance tokenizer for Vietnamese language. It is written in C++ with Python and Java bindings.
  • RDRSegmenter: Fast and accurate Vietnamese word segmenter (LREC 2018).
  • RDRPOSTagger: Fast and accurate POS and morphological tagging toolkit (EACL 2014).
  • VnCoreNLP: A Vietnamese natural language processing toolkit (NAACL 2018).
  • vlp-tok: Vietnamese text processing library developed in the Scala programming language.
  • ETNLP: A toolkit for Extraction, Evaluation and Visualization of Pre-trained Word Embeddings.
  • VietnameseTextNormalizer: Vietnamese Text Normalizer.
  • nnvlp: Neural network-based Vietnamese language processing toolkit.
  • jPTDP: Neural network models for joint POS tagging and dependency parsing (CoNLL 2017-2018).
  • vi_spacy: vietnamese language model compatible with Spacy.
  • underthesea: Underthesea - Vietnamese NLP toolkit.
  • vnlp: GATE plugin for Vietnamese language processing.
  • pyvi: Python Vietnamese toolkit.
  • JVnTextPro: Java-based Vietnamese text processing tool.
  • DongDu: C++ implementation of Vietnamese word segmentation tool.
  • VLSP Toolkit: Vietnamese tokenizer from VLSP.
  • vTools: Vietnamese NLP toolkit: Tokenizer, Sentence detector, POS tagger, Phrase chunker.
  • JNSP: Java Implementation of Ngram Statistic Package.

Pre-trained Language Model

  • RoBERTa Vietnamese: Pre-trained embedding using RoBERTa architecture on Vietnamese corpus.
  • PhoBERT: Pre-trained language models for Vietnamese (another implementation of RoBERTa for Vietnamese).
  • ALBERT for Vietnamese: "A Lite" version of BERT for Vietnamese.
  • Vietnamese ELECTRA: Electra pre-trained model using Vietnamese corpus.
  • word2vecVN: Pre-trained Word2Vec models for Vietnamese.

Sentiment Analysis

Benchmark

  • VLSP 2016 Share Task: Sentiment Analysis

    • Train: 5100 sentences (1700 positive, 1700 neutral, 1700 negative).

    • Test: 1050 sentences (350 positive, 350 neutral, 350 negative).

      Model F1 Paper Code
      Perceptron/SVM/Maxent 80.05 DSKTLAB: Vietnamese Sentiment Analysis for Product Reviews
      SVM/MLNN/LSTM 71.44 A Simple Supervised Learning Approach to Sentiment Classification at VLSP 2016
      Ensemble: Random forest, SVM, Naive Bayes 71.22 A Lightweight Ensemble Method for Sentiment Classification Task
      Ensemble: SVM, LR, LSTM, CNN 69.71 An Ensemble of Shallow and Deep Learning Algorithms for Vietnamese Sentiment Analysis
      SVM 67.54 Sentiment Analysis for Vietnamese using Support Vector Machines with application to Facebook comments
      SVM/MLNN 67.23 A Multi-layer Neural Network-based System for Vietnamese Sentiment Analysis at the VLSP 2016 Evaluation Campaign
      Multi-channel LSTM-CNN 59.61 Multi-channel LSTM-CNN model for Vietnamese sentiment analysis official
  • VLSP 2018 Shared Task: Aspect Based Sentiment Analysis

    • Restaurant Dataset: 2961 reviews (train), 1290 reviews (development), 500 reviews (test).

      Model Aspect (F1) Aspect Polarity (F1) Paper Code
      CNN 0.80 Deep Learning for Aspect Detection on Vietnamese Reviews
      SVM 0.77 0.61 NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
      SVM 0.54 0.48 Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task
    • Hotel Dataset: 3000 reviews (training), 2000 reviews (development), 600 reviews (test).

      Model Aspect (F1) Aspect Polarity (F1) Paper Code
      SVM 0.70 0.61 NLP@UIT at VLSP 2018: A Supervised Method For Aspect Based Sentiment Analysis
      CNN 0.69 Deep Learning for Aspect Detection on Vietnamese Reviews
      SVM 0.56 0.53 Using Multilayer Perceptron for Aspect-based Sentiment Analysis at VLSP 2018 SA Task
  • Vietnamese Student's Feedback Corpus (UIT-VSFC)

    • UIT-VSFC consists of over 16,000 sentences for sentiment analysis and topic classification.

      Model Sentiment (F1) Topic (F1) Paper Code
      Bi-LSTM/Word2Vec 0.896 0.92 Deep Learning versus Traditional Classifiers on Vietnamese Student’s Feedback Corpus
      Maximum Entropy Classifier 0.88 0.84 UIT-VSFC: Vietnamese Student’s Feedback Corpus for Sentiment Analysis

Named Entity Recognition

Benchmark

  • VLSP 2016 Shared Task: Named Entity Recognition

    Model F1 Paper Code
    PhoBERT_large 94.7 PhoBERT: Pre-trained language models for Vietnamese official
    vELECTRA + BiLSTM + Attention 94.07 Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
    PhoBERT_base 93.6 PhoBERT: Pre-trained language models for Vietnamese official
    XLM-R 92.0 PhoBERT: Pre-trained language models for Vietnamese
    VnCoreNLP-NER + ETNLP 91.3 ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
    BiLSTM-CNN-CRF + ETNLP 91.1 ETNLP: A visual-aided systematic approach to select pre-trained embeddings for a downstream task
    VNER: Attentive Neural Network 89.6 Attentive Neural Network for Named Entity Recognition in Vietnamese
    BiLSTM-CNN-CRF 88.3 VnCoreNLP: A Vietnamese Natural Language Processing Toolkit official
    LSTM + CRF 66.07 An investigation of Vietnamese Nested Entity Recognition Models
  • VLSP 2018 Shared Task: Named Entity Recognition

    Model F1 Paper Code
    vELECTRA + BiGRU 90.31 Improving Sequence Tagging for Vietnamese Text Using Transformer-based Neural Models
    VIETNER: CRF (ngrams + word shapes + cluster + w2v) 76.63 A Feature-Based Model for Nested Named-Entity RecognitionatVLSP-2018 NER Evaluation Campaign
    ZA-NER 74.70 ZA-NER: Vietnamese Named Entity Recognition at VLSP 2018 Evaluation Campaign

Speech Processing

Corpus:

Project

  • vietTTS: Tacotron + HiFiGAN vocoder for vietnamese datasets.