/ixa-pipe-convert

It converts corpora formats

Primary LanguageJavaApache License 2.0Apache-2.0

ixa-pipe-convert

Build Status GitHub license

This module implements several scripts to do corpus format conversions usually required to train or to evaluate IXA pipes models (http://ixa2.si.ehu.es/ixa-pipes).

TABLE OF CONTENTS

  1. Aspect-Based Sentiment Analysis (ABSA) conversions
  2. Cluster Lexicon conversions
  3. NAF to CoNLL conversions
  4. Markyt Formats conversions
  5. Installation

ABSA

There are several parameters to convert from and to the format used in the SemEval ABSA 2014-2016 shared tasks:

  • absa2015ToWFs: It converts ABSA SemEval 2014-2016 corpus to a tokenized NAF document. This function is used to obtain the evaluation set in the NAF format so that it can be annotated with ixa-pipe-opinion models for testing with the ABSA evaluation script.
  • nafToAbsa2014: The opinion-annotated NAF is converted to ABSA 2014 format for evaluation with the task official evaluation script.
  • nafToAbsa2015: The opinion-annotated NAF is converted to ABSA 2015 and 2016 formats for evaluation with the task official evaluation scripts.
  • absa2014ToCoNLL2002: It converts the ABSA 2014 corpus into CoNLL 2002 format for training ixa-pipe-opinion models.
  • absa2015ToCoNLL2002: It converts the ABSA 2015 and 2016 corpus into CoNLL 2002 format for training ixa-pipe-opinion models.
  • yelpGetText: It extracts the text element from the json formatted YELP dataset.

Clusters

There are several parameters to pre-process cluster lexicons obtained following the methods described in (https://github.com/ragerri/cluster-preprocessing)

NAF

  • nafToCoNLL02: It converts NAF containing named entities layer (entities) into CoNLL 2002 format.
  • nafToCoNLL03: It converts NAF containing named entities layer (entities) into CoNLL 2003 format.

MARKYT

  • barrToWFs: It converts markyt BARR 2017 corpus to a tokenized NAF document. This function is used to obtain the evaluation set in the NAF format so that it can be annotated with ixa-pipe-nerc models for testing.
  • nafToBARR: The entity-annotated NAF is converted to BARR format for evaluation with the task official evaluation scripts.
  • barrToCoNLL2002: It converts the BARR corpus into CoNLL 2002 format for training ixa-pipe-ml sequence models.

INSTALLATION

Install MAVEN

Download MAVEN 3 from

wget http://www.apache.org/dyn/closer.cgi/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz

Now you need to configure the PATH. For Bash Shell:

export MAVEN_HOME=/home/ragerri/local/apache-maven-3.0.4
export PATH=${MAVEN_HOME}/bin:${PATH}

For tcsh shell:

setenv MAVEN3_HOME ~/local/apache-maven-3.0.4
setenv PATH ${MAVEN3}/bin:{PATH}

If you re-login into your shell and run the command

mvn -version

Get module source code

git clone https://github.com/ragerri/ixa-pipe-convert.git
cd ixa-pipe-convert
mvn clean package

Usage

java -jar target/ixa-pipe-convert-$version-exec.jar -help

GENERATING JAVADOC

You can also generate the javadoc of the module by executing:

mvn javadoc:jar

Contact information

Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
rodrigo.agerri@ehu.eus