/crf4j

a complete Java port of crfpp(crf++)

Primary LanguageJava

crf4j: CRF model training and testing for Java

Build Status

This is a pure Java port of taku's crfpp(also known as crf++), which is based on codes of crfpp-0.58.

Credits to komiya's for his Java double array trie implementation.

Features

  • pure Java, with least dependencies(only commons-cli as runtime deps)
  • compatible commandline options and template/input format with crfpp
  • load model from classpath
  • compatible text model format with crfpp
  • convert text model to (our)binary model and (our)binary model to text model
  • multi-threading support
  • CRF-L1/CRF-L2/MIRA algorithms supports
  • n-best outputs
  • CRF Model wrapper for API call
  • Tests and demo for usage demonstration

Usage

Building

mvn clean package

Run tests:

mvn test

Training

java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfLearn <template file> <train datafile> <model path>

For more options, please run

java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfLearn -h

For details on format of template file and train file, please refer to original page of crfpp.

Testing

to print output to console:

java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfTest -m <model path> <test datafile>

to print output to file:

java -cp crf4j-<version>-jar-with-dependencies.jar com.github.zhifac.crf4j.CrfTest -m <model path> <test datafile> -o <outputfile>

API call

please refer to CrfDemo.java.

Performance

Concurrent Access

In an example of using crf4j model to recognize name entity, we used jmeter to test 400 concurrent access to the same Http interface, and here is the result.

#Samples Average Median 90% Line Min Max Throughput
4000 41 4 60 0 746 1250/sec

The test environment is:

OS CPU MEM
Windows 7x64 Intel Core i5-4200U@1.60GHz 8GB

Notes

The binary model generated by CrfLearn is incompatible with crfpp, but the text model is. If you somehow want to reuse a crfpp model with crf4j, please generate a text model when you train with crfpp(add -t option), and then run java -cp crf4j.jar com.github.zhifac.crf4j.EncoderFeatureIndex <crfpp_text_model> <output_crf4j_binarymodel> to convert the crfpp text model to crf4j binary model. Or if you somehow can not retrain the same text model(e.g. missing train data), you can still convert an existing crfpp binary model to text model with modified version of crfpp from here.

TODO

  • Optimize memory usage when training(it currently consumes about 8GB heap memory for 24224128 features, whereas crfpp uses 2GB)

License

LGPL & Modified BSD


Chinese version:

crf4j: crfpp(crf++)的Java实现

(基于crfpp 0.58)