/CoreNLP-jMWE

Stanford CoreNLP annotator implementing jMWE for detecting Multi-Word Expressions / collocations

Primary LanguageJavaGNU General Public License v3.0GPL-3.0

Table of Contents

CoreNLP-jMWE
Installation and use
Demo
Annotator properties
MWE detectors
License
Author information
Original Stanford CoreNLP README

CoreNLP-jMWE

This Stanford CoreNLP annotator is able to capture Multi-Word Expressions (MWE) / collocations from plain text. Under the hood, it integrates jMWE, which itself employs a database generated through processing WordNet.

For example, given the input sentence

She traveled to Las Vegas and looked up the world record.

the detected MWEs information could include:

1. travel_to_V={traveled_VBD,to_TO},    token.isInflected(): true,  token.getForm(): traveled_to
2. las_vegas_N={Las_NNP,Vegas_NNP},     token.isInflected(): false, token.getForm(): las_vegas
3. las_vegas_P={Las_NNP,Vegas_NNP},     token.isInflected(): false, token.getForm(): las_vegas
4. look_up_V={looked_VBD,up_RP},        token.isInflected(): true,  token.getForm(): looked_up
5. world_record_N={world_NN,record_NN}, token.isInflected(): false, token.getForm(): world_record

Detecting MWEs can be useful in a number of applications, such as:

  1. Extracting a subset of n-grams (collocations) from a corpus for topic modeling and information retrieval purposes.
  2. Word Sense Disambiguation problems, see the publications on jMWE for more information.

This is a ready to use new annotator for Stanford CoreNLP, and can therefore be easily integrated into any further downstream NLP processing the same way as any other already existing CoreNLP annotator.

Installation and use

Use without compilation

For convenience, jar files are provided in this repository. To use CoreNLP-jMWE in a project that has Stanford CoreNLP, simply:

  1. Add stanford-corenlp-jmwe-annotator.jar to your projects CLASSPATH
  2. Add either edu.mit.jmwe_1.0.2.jar (binary) or edu.mit.jmwe_1.0.2_jdk.jar (binary and src) from jMWE to your projects CLASSPATH
  3. Download the standard MWE Index data file mweindex_wordnet3.0_semcor1.6.data from jMWE
  4. Setup your CoreNLP Properties to use CoreNLP-jMWE, see the Demo section for a complete example

Done!

Compilation preparation

  1. Clone the project:

     git clone https://github.com/toliwa/CoreNLP-jMWE
    
  2. Create the lib directory:

     mkdir CoreNLP-jMWE/lib
    

Getting Stanford CoreNLP

Either get the stable release (it comes with the model jar):

  1. Download Stanford CoreNLP from http://stanfordnlp.github.io/CoreNLP/index.html and copy all its jar files to CoreNLP-jMWE/lib

Or get the current master branch:

  1. Clone Stanford CoreNLP from https://github.com/stanfordnlp/CoreNLP and copy all its jar files to CoreNLP-jMWE/lib
  2. Get the CoreNLP models from http://nlp.stanford.edu/software/stanford-corenlp-models-current.jar and copy them to CoreNLP-jMWE/lib

Getting jMWE

From jMWE, download:

  1. Either edu.mit.jmwe_1.0.2.jar (binary) or edu.mit.jmwe_1.0.2_jdk.jar (binary and src)

  2. The Standard MWE Index data file mweindex_wordnet3.0_semcor1.6.data

and copy them to CoreNLP/lib.

The compilation preparation is done!

Getting JUnit

From JUnit, download:

  1. hamcrest-core-1.3.jar and junit-4.12.jar and copy them to CoreNLP-jMWE/lib

The compilation preparation is done!

Compiling, jar packaging and unit tests

  1. To compile the project and create jar files, go inside the CoreNLP-jMWE folder and run

     ant jar
    
  2. To run the unit tests, you need to edit the JMWEAnnotatorTest.java and point the public static final String index variable to the location on your file system where the mweindex_wordnet3.0_semcor1.6.data file is stored. Then, run:

     ant clean && ant junit
    

Running the unit tests should, among other tests, output something similar to

    [junit] Running edu.stanford.nlp.pipeline.JMWEAnnotatorTest
    ...
    [junit] Tests run: 12, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 4.337 sec

to indicate that all tests have passed.

Demo

The class JMWEAnnotatorDemo in the package edu.stanford.nlp.pipeline.demo shows a basic use case of the new annotator, given a String it will detect and print out MWE information.

If you have downloaded the jar files from this repository or have created them (see above), you can also run the runJMWEDemo.sh shell script, it can call the JMWEAnnotatorDemo with a predefined text if no further input is given, or use the commandline first argument as input. To get the output as described in the first section, type:

./runJMWEDemo.sh "She traveled to Las Vegas and looked up the world record."

The following method from JMWEAnnotatorDemo shows how to programmatically access detected MWEs. The way to do so mirrors accessing other annotators such as TokensAnnotation as described for the Stanford CoreNLP demo.

    /**
     * jMWE Demo, prints out discovered MWEs from the text
     * @param index the index
     * @param text the text
     */
    public static void jmweDemo(String index, String text) {
        // creates the properties for Stanford CoreNLP: tokenize, ssplit, pos, lemma, jmwe
        Properties props = new Properties();
        props.setProperty("annotators", "tokenize, ssplit, pos, lemma, jmwe");
        props.setProperty("customAnnotatorClass.jmwe", "edu.stanford.nlp.pipeline.JMWEAnnotator");
        props.setProperty("customAnnotatorClass.jmwe.verbose", "false");
        props.setProperty("customAnnotatorClass.jmwe.underscoreReplacement", "-");
        props.setProperty("customAnnotatorClass.jmwe.indexData", index);
        props.setProperty("customAnnotatorClass.jmwe.detector", "CompositeConsecutiveProperNouns");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        
        // put the text in the document annotation 
        Annotation doc = new Annotation(text);
        
        // run the CoreNLP pipeline on the document
        pipeline.annotate(doc);
        
        // loop over the sentences
        List<CoreMap> sentences = doc.get(SentencesAnnotation.class);
        System.out.println();
        for(CoreMap sentence: sentences) {
            System.out.println("Sentence: "+sentence);
            // loop over all discovered jMWE token and perform some action
          for (IMWE<IToken> token: sentence.get(JMWEAnnotation.class)) {
              System.out.println("IMWE<IToken>: "+token+", token.isInflected(): "+token.isInflected()+", token.getForm(): "+token.getForm());
            }
          System.out.println();
        }        
    }

Annotator properties

As can be seen in the example above, the JMWEAnnotator has several properties to be set:

props.setProperty("annotators", "tokenize, ssplit, pos, lemma, jmwe");

"jmwe" is the reference name of the annotator, it has to be called after lemmatization.

props.setProperty("customAnnotatorClass.jmwe", "edu.stanford.nlp.pipeline.JMWEAnnotator")

The full path to its class needs to be set.

props.setProperty("customAnnotatorClass.jmwe.verbose", "false");

By default, verbose is false. If set to true, Annotator information and MWEs will be output during the detection stage.

props.setProperty("customAnnotatorClass.jmwe.underscoreReplacement", "-");

jMWE 1.0.2 throws an IllegalArgumentException when given a "_" symbol, so underscores have to be replaced during detection. A possible choice would be the hyphen symbol "-".

props.setProperty("customAnnotatorClass.jmwe.indexData", index);

indexData needs a path to the mweindex_wordnet3.0_semcor1.6.data file.

props.setProperty("customAnnotatorClass.jmwe.detector", "CompositeConsecutiveProperNouns");

The type of the MWE detector needs to be defined, currently implemented are "Consecutive", "Exhaustive", "ProperNouns", "Complex" and "CompositeConsecutiveProperNouns".

MWE detectors

jMWE comes with a straight-forward way to write and combine detectors, filters and resolvers for MWE discovery. A detailed introduction about this is given in the jMWE user manual.

The class JMWEAnnotator contains the following method, where the MWE detectors are setup. Adding more detectors is only a matter of adding another switch case and instantiating the desired setup as introduced in the jMWE user manual.

The choice of the right MWE detector will greatly impact the retrieved MWEs, on the jMWE page there are jMWE publications, wich provide precision, recall and f-measures for different implemented detectors.

    /**
     * Get the detector.
     * 
     * @param index
     *            the index
     * @param detector 
     *            the detector, \"Consecutive\", \"Exhaustive\", \"ProperNouns\", \"Complex\" or \"CompositeConsecutiveProperNouns\" are supported
     * @return the detector
     */
    public IMWEDetector getDetector(IMWEIndex index, String detector) {
        IMWEDetector iMWEdetector = null;
        switch (detector) {
        case "Consecutive":
            iMWEdetector = new Consecutive(index);
            break;
        case "Exhaustive":
            iMWEdetector = new Exhaustive(index);
            break;
        case "ProperNouns":
            iMWEdetector = ProperNouns.getInstance();
            break;
        case "Complex":
            iMWEdetector = new CompositeDetector(ProperNouns.getInstance(),
                    new MoreFrequentAsMWE(new InflectionPattern(new Consecutive(index))));
            break;
        case "CompositeConsecutiveProperNouns":
            iMWEdetector = new CompositeDetector(new Consecutive(index), ProperNouns.getInstance());            
            break;
        default:
            throw new IllegalArgumentException("Invalid detector argument " + detector
                    + ", only \"Consecutive\", \"Exhaustive\", \"ProperNouns\", \"Complex\" or \"CompositeConsecutiveProperNouns\" are supported.");
        }
        return iMWEdetector;
    }

License

This annotator has the same license as Stanford CoreNLP, which states on https://github.com/stanfordnlp/CoreNLP : "The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later)."

The jMWE files (not included here, available from http://projects.csail.mit.edu/jmwe/ ) have the following note on their page: "The software is distributed under the Creative Commons Attribution 4.0 International License that makes it free to use for any purpose, as long as proper copyright acknowledgement is made."

Name, copyright and other acknowledgements for "jMWE" under the http://creativecommons.org/licenses/by/4.0/ from the page and author of Finlayson, M.A. at http://projects.csail.mit.edu/jmwe/ :

Finlayson, M.A. and Kulkarni, N. (2011) Detecting Multi-Word Expressions Improves Word Sense Disambiguation, Proceedings of the 8th Workshop on Multiword Expressions, Portland, OR. pp. 20-24.

Kulkarni, N. and Finlayson, M.A. jMWE: A Java Toolkit for Detecting Multi-Word Expressions, Proceedings of the 8th Workshop on Multiword Expressions, Portland, OR. pp. 122-124.

Hereby proper copyright acknowledgement is made as required.

The GNU Project states that GPL and the Creative Commons Attribution 4.0 license are compatible: http://www.gnu.org/licenses/license-list.en.html#OtherLicenses

Author information

The CoreNLP-jMWE annotator is developed by Tomasz Oliwa at the Center for Research Informatics (CRI), University of Chicago.

Original Stanford CoreNLP README

Original Stanford CoreNLP README.md from https://github.com/stanfordnlp/CoreNLP :

Stanford CoreNLP

Stanford CoreNLP provides a set of natural language analysis tools written in Java. It can take raw human language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize and interpret dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases or word dependencies, and indicate which noun phrases refer to the same entities. It was originally developed for English, but now also provides varying levels of support for (Modern Standard) Arabic, (mainland) Chinese, French, German, and Spanish. Stanford CoreNLP is an integrated framework, which make it very easy to apply a bunch of language analysis tools to a piece of text. Starting from plain text, you can run all the tools with just two lines of code. Its analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications. Stanford CoreNLP is a set of stable and well-tested natural language processing tools, widely used by various groups in academia, industry, and government. The tools variously use rule-based, probabilistic machine learning, and deep learning components.

The Stanford CoreNLP code is written in Java and licensed under the GNU General Public License (v3 or later). Note that this is the full GPL, which allows many free uses, but not its use in proprietary software that you distribute to others.

You can find releases of Stanford CoreNLP on Maven Central.

You can find more explanation and documentation on the Stanford CoreNLP homepage.

The most recent models associated with the code in the HEAD of this repository can be found here.

For information about making contributions to Stanford CoreNLP, see the file CONTRIBUTING.md.

Questions about CoreNLP can either be posted on StackOverflow with the tag stanford-nlp, or on the mailing lists.