/CLAVIN-NERD

Stanford NLP Implementation of the CLAVIN LocationTagger

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

CLAVIN-NERD LOGO

CLAVIN-NERD Master

License: GPL v3

CLAVIN-NERD


CLAVIN-NERD is a GPL-licensed "wrapper project" that connects the Apache-licensed CLAVIN geoparser with the GPL-licensed Stanford CoreNLP NER entity extractor.

Using CLAVIN with Stanford NER (i.e., the CLAVIN-NERD distribution) results in significantly higher accuracy than with the default Apache OpenNLP NameFinder entity extractor. We recommend using CLAVIN-NERD or Novetta's AdaptNLP over OpenNLP. Stanford NER is not included in the standard CLAVIN release because Stanford NER is GPL-licensed and we are committed to distributing CLAVIN itself via the Apache License. Thus, the GPL-licensed CLAVIN-NERD distribution makes CLAVIN available for use with Stanford NER while preserving the freedom of the core CLAVIN source code under the terms of the Apache License.

Novetta also maintains the CLAVIN-Rest project, which provides a RESTful microservice wrapper around CLAVIN or CLAVIN-NERD. To use CLAVIN-NERD with CLAVIN-Rest, you simply have to edit the CLAVIN-Rest POM. CLAVIN-Rest is configured (and provides instructions) to easily build and run this package as a docker image.

Breaking changes

This release includes breaking changes in the form of an update to all namespaces. The namespaces have been changed from com.bericotech to com.novetta which reflects a change in corporate ownership, and re-alignment to our new domain.

How to build and use CLAVIN-NERD:

CLAVIN-NERD relies on CLAVIN to build its lucene index. You can refer to the instructions for getting started with CLAVIN before attempting to work with CLAVIN-NERD. Here are the instructions for building the index using CLAVIN-NERD:

  1. Check out a copy of the source code:
git clone https://github.com/Novetta/CLAVIN-NERD.git
  1. Move into the newly-created CLAVIN-NERD directory:
cd CLAVIN-NERD
  1. Download the latest version of allCountries.zip gazetteer file from GeoNames.org:
curl -O http://download.geonames.org/export/dump/allCountries.zip
  1. Unzip the GeoNames gazetteer file:
unzip allCountries.zip
  1. Package the source code:
mvn clean package
  1. Create the Lucene Index (this one-time process will take several minutes):
MAVEN_OPTS="-Xmx4g" mvn exec:java -Dexec.mainClass="com.novetta.clavin.index.IndexDirectoryBuilder"
  1. Run the example program:

Once you've used CLAVIN to build the required Lucene index with the GeoNames.org gazetteer, consult WorkflowDemoNERD.java for multiple examples of different ways to use CLAVIN-NERD. You can run the CLAVIN-NERD demo from the command line with the following command:

MAVEN_OPTS="-Xmx2g" mvn exec:java -Dexec.mainClass="com.novetta.clavin.nerd.WorkflowDemoNERD"	

The main difference between using CLAVIN and CLAVIN-NERD is in the arguments passed to the GeoParserFactory class to instantiate a GeoParser object. With CLAVIN-NERD, we need to specify that we want to use the StanfordExtractor to extract location names from text.

Here's an example call to GeoParserFactory where we specify that the StanfordExtractor should be used, as seen in the WorkflowDemoNERD class:

GeoParserFactory.getDefault("./IndexDirectory", new StanfordExtractor(), 1, 1, false);

Don't forget: Loading the worldwide gazetteer uses a non-trivial amount of memory. When using CLAVIN-NERD in your own programs, if you encounter Java heap space errors, bump up the maximum heap size for your JVM. Allocating 2GB (e.g., -Xmx2g) is a good place to start.

Get it from Maven Central:

<dependency>
    <groupId>com.novetta</groupId>
    <artifactId>CLAVIN-nerd</artifactId>
    <version>3.0.0</version>
</dependency>

License:

Since the Stanford CoreNLP NER library is licensed via the GPL, CLAVIN-NERD is as well. However, CLAVIN itself remains under the Apache License, version 2.


CLAVIN-NERD Copyright (C) 2012-2020 Novetta

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.