Vietnamese Named Entity Recognition
This is the first public release of a Vietnamese named entity recognition system, developed by Phuong LE-HONG at College of Sciences, Vietnam National University in Hanoi.
This tool implements a conditional Markov model (CMM), a common probabilistic model for sequence labelling. In
essence, CMM is a discriminative model which models the conditional probability distribution P(tag sequence|word sequence)
.
This probability is decomposed into a chain of local probability distributions P(tag|word)
by using the Markov property.
Each local probability distribution is a log-linear model (also called a maximum entropy model).
Actually, the implemented approaches combines both a forward sequence model and a backward sequence model and token regular expressions to achieve the first-rank result of the VLSP 2016 NER shared task, organized by the Vietnamese Speech and Language Processing Society. On the standard test set, its F1 score is about 88.8%.
The detailed approach is described in the following paper:
- Vietnamese Named Entity Recognition using Token Regular Expressions and Bidirectional Inference, Phuong Le-Hong, Proceedings of Vietnamese Speech and Language Processing (VLSP), Hanoi, Vietnam, 2016.
This tool is implemented in the Scala programming language. It utilizes Apache Spark as its core platform. Apache Spark is a fast and general engine for large scale data processing. Therefore, Vitk.NER is a fast cluster computing toolkit.
Setup and Compilation
-
Prerequisites: A Java Development Kit (JDK), version 8.0 or later JDK. Apache Maven version 3.0 or later Maven. Make sure that two following commands work perfectly in your shell (console window).
`java -version` `mvn -version`
-
Download a prebuilt version of Apache Spark. Vitk uses Spark version 2.2.0. Unpack the compressed file to a directory, for example
~/spark
where~
is your home directory. -
Download Vitk.NER, either a binary archive or its source code. The repository URL of the project is Vitk. The source code version is preferable. It is easy to compile and package Vitk: go to the top-level directory of Vitk and invoke the following command at a shell window:
`mvn compile package`
Apache Maven will automatically resolve and download dependency libraries required by Vitk. Once the process finishes, you should have a binary jar file
ai.vitk.ner-1.0.jar
in the sub-directorytarget
.
Running
Data Files
Pre-trained model files used by Vitk.NER are included in the resources directory (src/main/resources/
). For ease of use,
these data files are packaged in the JAR file.
These folders can contain data specific to a natural language in
use. Each language is specified further by a sub-directory whose name
is an abbreviation of the language name, for example vi
for
Vietnamese, en
for English, ja
for Japanese, etc.
Vitk can run as an application on a stand-alone cluster mode or on a real cluster. If it is run on a cluster, it is required that all machines in the cluster are able to access the same data files, which are normally located in a shared directory readable by all the machines, or a Hadoop file system.
Command Line Arguments
Vitk.NER is an Apache Spark application, you run it by submitting the
main JAR file ai.vitk.ner-1.0.jar
to Apache Spark. The main class of the
toolkit is ai.vitk.ner.Tagger
which selects the desired tool by following
arguments provided by the user.
The arguments of Vitk.NER are as follows:
-
--master <master-url>
: the master URL for the cluster, for examplespark://192.168.1.1:7077
. If you do not have a cluster, you can ignore this parameter. In this case, Vitk uses the stand-alone cluster mode, which is defined bylocal[*]
, that is, it uses all the CPU cores of your single machine for parallel processing. -
--language <language>
: the natural language to process, wherelanguage
is an abbreviation of language name which is eithervi
(Vietnamese) oren
(English). If this argument is not specified, the default language is Vietnamese. -
--verbose
: this parameter does not require argument. If it is used, Vitk runs in verbose mode, in which some intermediate information will be printed out during the processing. This is useful for debugging. -
--mode <mode>
: the running mode, eithertag
,train
, oreval
, the default mode istag
. -
--input <input-file>
: the name of an input file to be used. This should be a text file in UTF-8 encoding. If the tagger is in thetag
mode, it will read and tag every lines of this file. If it is in theeval
ortrain
mode, it will preprocess the file to get pairs of word and tag sequences for use in evaluating or training. -
--dimension <dimension>
: this argument is only required in thetrain
mode to specify the number of features (or the domain dimension) of the resulting Conditional Markov Model. The dimension is a positive integer and depends on the size of the data. Normally, the larger the training data is, the greater the dimension that should be used. Vitk implements the feature hashing trick, which is a fast and space-efficient way of vectorizing features. As an example, we set this argument as 32,768 when training a CMM on about 16,000 tagged sentences of the VLSP NER corpus. The default dimension is 32,768. -
--iteration <n>
: this argument is only used in thetrain
mode to specify the number of iterations used in the L-BFGS optimization algorithm. The default is 600. -
--reversed
: this parameter does not require argument. If it is used, the tool trains or tests using reversed sentences to produce a backward sequence model instead of the default forward sequence model. -
--memory <mem>
: specify the memory used in Spark executors, for example4g
,8g
. If you are training a large corpus, you should consider setting an appropriate value for this parameter for Spark efficiency and avoid out of memory errors.
Running
Suppose that Apache Spark has been installed in ~/spark
, Vitk.NER has
been installed in ~/ai.vitk.ner
. To launch Vitk.NER, open a console, enter the
folder ~/ai.vitk.ner
and invoke an appropriate command. For example:
To tag an input file and write the result to an output file of the same name (with generated suffix .out
), using
the default pre-trained model:
./spark/bin/spark-submit target/ai.vitk.ner-1.0.jar --mode tag --input <input-file>
Because the default mode of the tagger is tag
, we can therefore drop the argument
--mode tag
in the command above.
There is not any --master
argument in the command above, therefore, Vitk.NER
runs in the stand-alone cluster mode which uses a single local machine and all CPU cores.
- To evaluate the accuracy on a gold corpus in the resource path
/ner/vi/test.txt
, using the default pre-trained CMM:
./spark/bin/spark-submit target/ai.vitk.ner-1.0.jar --mode eval
- To train a forward model tagger on a gold corpus in the resource path
/ner/vi/train.txt
:
./spark/bin/spark-submit target/ai.vitk.ner-1.0.jar --mode train --dimension 4096
The resulting model will be saved in the resource directory src/main/resources/ner/vi/mlr
.
- To train a backward model tagger on a gold corpus in the resource path
/ner/vi/train.txt
:
./spark/bin/spark-submit target/ai.vitk.ner-1.0.jar --mode train --reversed
The resulting model will be saved in the resource directory src/main/resources/ner/vi/mlr-reversed
.
It has 32,768 dimensions (the default number of dimensions) and is saved to the
default directory src/main/resources/ner/vi/mlr-reversed/
.
If you wish to run Vitk on a Spark cluster, all you need to do is to specify the master URL of the cluster, such as:
./spark/bin/spark-submit target/ai.vitk.ner-1.0.jar --mode train --reversed --master spark://192.168.1.1:7077
References
If you use Vitk.NER in your research, please credit our work by citing the following publication:
- Vietnamese Named Entity Recognition using Token Regular Expressions and Bidirectional Inference, Phuong Le-Hong, Proceedings of Vietnamese Speech and Language Processing (VLSP), Hanoi, Vietnam, 2016.
Contact
Any bug reports, suggestions and collaborations are welcome. I am reachable at:
- LE-HONG Phuong
- College of Science, Vietnam National University in Hanoi