VnCoreNLP is a Java NLP annotation pipeline for Vietnamese, providing rich linguistic annotations through key NLP components of word segmentation, POS tagging, named entity recognition (NER) and dependency parsing:
- ACCURATE – VnCoreNLP is the most accurate toolkit for Vietnamese NLP, obtaining state-of-the-art results on standard benchmark datasets.
- FAST – VnCoreNLP is fast, so it can be used for dealing with large-scale data.
- Easy-To-Use – Users do not have to install external dependencies. Users can run processing pipelines from either the command-line or the Java API.
The general architecture and experimental results of VnCoreNLP can be found in the following related papers:
- Thanh Vu, Dat Quoc Nguyen, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2018. VnCoreNLP: A Vietnamese Natural Language Processing Toolkit. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, NAACL 2018, pages 56-60. [.bib]
- Dat Quoc Nguyen, Dai Quoc Nguyen, Thanh Vu, Mark Dras and Mark Johnson. 2018. A Fast and Accurate Vietnamese Word Segmenter. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LREC 2018, pages 2582-2587. [.bib]
- Dat Quoc Nguyen, Thanh Vu, Dai Quoc Nguyen, Mark Dras and Mark Johnson. 2017. From Word Segmentation to POS Tagging for Vietnamese. In Proceedings of the 15th Annual Workshop of the Australasian Language Technology Association, ALTA 2017, pages 108-113. [.bib]
Please CITE paper [1] whenever VnCoreNLP is used to produce published results or incorporated into other software. If you are dealing in depth with either word segmentation or POS tagging, you are encouraged to also cite paper [2] or [3], respectively.
NOTE that if you are looking for light-weight versions, VnCoreNLP's word segmentation and POS tagging components have also been released as independent packages RDRsegmenter [2] and VnMarMoT [3], resepectively.
VnCoreNLP is free for non-commercial use and distributed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA) License.
Assume that Java 1.8+ is already set to run in the command line or terminal (for example: adding Java to the environment variable path
in Windows OS); and file VnCoreNLP-1.0.1.jar
(27MB) and folder models
(113MB) are placed in the same working folder. You can run VnCoreNLP to annotate an input raw text corpus (e.g. a collection of news content) by using following commands:
//To perform word segmentation, POS tagging, NER and then dependency parsing
$ java -Xmx2g -jar VnCoreNLP-1.0.1.jar -fin input.txt -fout output.txt
// To perform word segmentation, POS tagging and then NER
$ java -Xmx2g -jar VnCoreNLP-1.0.1.jar -fin input.txt -fout output.txt -annotators wseg,pos,ner
// To perform word segmentation and then POS tagging
$ java -Xmx2g -jar VnCoreNLP-1.0.1.jar -fin input.txt -fout output.txt -annotators wseg,pos
// To perform word segmentation
$ java -Xmx2g -jar VnCoreNLP-1.0.1.jar -fin input.txt -fout output.txt -annotators wseg
The following code is a simple and complete example:
import vn.pipeline.*;
import java.io.*;
public class VnCoreNLPExample {
public static void main(String[] args) throws IOException {
// "wseg", "pos", "ner", and "parse" refer to as word segmentation, POS tagging, NER and dependency parsing, respectively.
String[] annotators = {"wseg", "pos", "ner", "parse"};
VnCoreNLP pipeline = new VnCoreNLP(annotators);
String str = "Ông Nguyễn Khắc Chúc đang làm việc tại Đại học Quốc gia Hà Nội. Bà Lan, vợ ông Chúc, cũng làm việc tại đây.";
Annotation annotation = new Annotation(str);
pipeline.annotate(annotation);
System.out.println(annotation.toString());
// 1 Ông Nc O 4 sub
// 2 Nguyễn_Khắc_Chúc Np B-PER 1 nmod
// 3 đang R O 4 adv
// 4 làm_việc V O 0 root
// ...
//Write to file
PrintStream outputPrinter = new PrintStream("output.txt");
pipeline.printToFile(annotation, outputPrinter);
// You can also get a single sentence to analyze individually
Sentence firstSentence = annotation.getSentences().get(0);
System.out.println(firstSentence.toString());
}
}
See VnCoreNLP's open-source in folder src
for API details.
We briefly present experimental setups and obtained results in the following subsections. See details in papers [1,2,3] above.
- Training data: 75k manually word-segmented training sentences from the VLSP 2013 word segmentation shared task.
- Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model | F1 (%) | Speed (words/second) |
VnCoreNLP (i.e. RDRsegmenter) | 97.90 | 62k / _ |
UETsegmenter | 97.87 | 48k / 33k* |
vnTokenizer | 97.33 | _ / 5k* |
JVnSegmenter-Maxent | 97.00 | _ / 1k* |
JVnSegmenter-CRFs | 97.06 | _ / 1k* |
DongDu | 96.90 | _ / 17k* |
- Speed is computed on a personal computer of Intel Core i7 2.2 GHz, except when specifically mentioned. * denotes that the speed is computed on a personal computer of Intel Core i5 1.80 GHz.
- See paper [2] for more details.
- 27,870 sentences for training and development from the VLSP 2013 POS tagging shared task:
- 27k sentences are used for training.
- 870 sentences are used for development.
- Test data: 2120 test sentences from the VLSP 2013 POS tagging shared task.
Model | Accuracy (%) | Speed |
VnCoreNLP (i.e. VnMarMoT) | 95.88 | 25k |
RDRPOSTagger | 95.11 | 180k |
BiLSTM-CRF | 95.06 | 3k |
BiLSTM-CRF + CNN-char | 95.40 | 2.5k |
BiLSTM-CRF + LSTM-char | 95.31 | 1.5k |
- See paper [3] for more details.
- 16,861 sentences for training and development from the VLSP 2016 NER shared task:
- 14,861 sentences are used for training.
- 2k sentences are used for development.
- Test data: 2,831 test sentences from the VLSP 2016 NER shared task.
- NOTE that the original VLSP 2016 NER data also consists of gold POS and chunking tags as reconfirmed by VLSP 2016 organizers. Also in the VLSP 2016 NER data, each word representing a full personal name are separated into syllables that constitute the word. This scheme results in an unrealistic scenario for a pipeline evaluation:
- Gold POS and chunking tags are NOT available in a real-world application.
- The standard annotation for Vietnamese word segmentation and POS tagging forms each full name as a word token, thus all word segmenters have been trained to output a full name as a word and all POS taggers have been trained to assign a POS label to the entire full-name.
- For a realistic scenario, we merge those contiguous syllables constituting a full name to form a word. Then to obtain predicted POS tags for training, developement and test sentences, we perform POS tagging by using our tagging component. The results are as follows:
Model | F1 | Speed |
VnCoreNLP | 88.55 | 18k |
BiLSTM-CRF | 86.48 | 2.8k |
BiLSTM-CRF + CNN-char | 88.28 | 1.8k |
BiLSTM-CRF + LSTM-char | 87.71 | 1.3k |
BiLSTM-CRF + predicted POS | 86.12 | _ |
BiLSTM-CRF + CNN-char + predicted POS | 88.06 | _ |
BiLSTM-CRF + LSTM-char + predicted POS | 87.43 | _ |
- Here, for VnCoreNLP, we include the time POS tagging takes in the speed.
- See paper [1] for more details.
- We use the Vietnamese dependency treebank VnDT consisting of 10,200 sentences. We use the last 1020 sentences of VnDT for test while the remaining sentences are used for training.
Model | LAS (%) | UAS (%) | Speed | |
---|---|---|---|---|
Gold POS | VnCoreNLP | 73.39 | 79.02 | _ |
BIST-bmstparser | 73.17 | 79.39 | _ | |
BIST-barchybrid | 72.53 | 79.33 | _ | |
MSTparser | 70.29 | 76.47 | _ | |
MaltParser | 69.10 | 74.91 | _ | |
Predicted POS | VnCoreNLP | 70.23 | 76.93 | 8k |
jPTDP | 69.49 | 77.68 | 700 |
- See paper [1] for more details.