Setting up ShangriDocs with DUCC and cTAKES

This guide describes steps to set up the document exploration tool ShangriDocs with Apache UIMA DUCC and Apache cTAKES.

This guide includes

Prerequisites

Before installing DUCC, create user ducc and enable passwordless ssh for user ducc. Example for setting this up (for Red Hat Linux) can be found in DUCC Prerequisites

References

cTAKES Scale Out with UIMA DUCC, Quick Start Tutorial, DUCC Documentation

Download the binary installation file uima-ducc-2.0.1-bin.tar.gz to /home/ducc/

Then, from /home/ducc/

$ tar -xvzf uima-ducc-2.0.1-bin.tar.gz

Once the files are extracted, you need to configure DUCC to your system

$ cd apache-uima-ducc-2.0.1/admin/

$ ./ducc_post_install

The ducc_post_install script sets up the default configuration in ducc.properties. This is where you can define:

  • Hostname of the DUCC head node
  • Full path to your Java executable (default is /usr/bin/java)

The default configuration file is located at .../apache-uima-ducc-2.0.1/resources/default.ducc.properties

Running ducc_post_install copies the parameters in default.ducc.properties, along with incorporating the hostname of head node and the java path to ducc.properties. Any changes made to ducc.properties will be overwritten from running ducc_post_install. Refer to Modifying ducc properties for making changes to the properties.

Starting DUCC

From .../apache-uima-ducc-2.0.1/admin

$ ./start_ducc

Note: Wait at least a minute, after starting DUCC, before submitting any jobs. It takes DUCC a while for all initialization to be completed. If you submit a job before the initialization has been completed, it will return errors, such as type=system error, text=job driver node unavailable.

Checking status

From .../apache-uima-ducc-2.0.1/admin

$ ./check_ducc

The web interface to monitor the system and jobs can be accessed via a browser using

  • http://[DUCC hostname]:42133/system.daemons.jsp

  • http://[DUCC hostname]:42133/jobs.jsp

Testing DUCC

Submit a simple example job via command line

$ /home/ducc/apache-uima-ducc-2.0.1/bin/ducc_submit -f /home/ducc/apache-uima-ducc-2.0.1/examples/simple/1.job

Stopping DUCC

$ /home/ducc/apache-uima-ducc-2.0.1/admin/stop_ducc -a

#### Further configuration to DUCC properties

Modification to DUCC's configuration should be performed in default.ducc.properties . You can manually change the parameters in this file. Make sure to run the ducc_post_install script again after modification are made. Changes will take effect after DUCC is restarted.

DUCC will cancel a job and related processes when encountering illegal characters. This issue arises from the CollectionReader when the Job Driver is putting illegal characters in the work item CAS, which cannot be XML serialized. Produced error message will include Running Ducc Containerjava.lang.RuntimeException: JP Http Client Unable to Communicate with JD.

Download the binary installation file apache-ctakes-3.2.2-bin.tar.gz to /home/ducc/

If this site doesn’t work, other mirror sites can be found from cTAKES Download Page, go to bottom of page “Current Download Mirror:”

Then, from /home/ducc/

$ tar -xvzf apache-ctakes-3.2.2-bin.tar.gz
$ curl -Lo ctakes-resources-3.2.1.1-bin.zip "http://downloads.sourceforge.net/project/ctakesresources/ctakes-resources-3.2.1.1-bin.zip?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fctakesresources%2F%3Fsource%3Dtyp_redirect&ts=1433609725&use_mirror=softlayer-dal"
$ mv ctakes-resources-3.2.1.1-bin.zip /home/ducc/apache-ctakes-3.2.2
$ cd /home/ducc/apache-ctakes-3.2.2
$ unzip ctakes-resources-3.2.1.1-bin.zip

You can also follow the instructions in cTAKESParser for Installing cTAKES. Note: This version of ShangriDocs DO NOT require Tika working with cTAKES as a server, like the finished product of cTAKESParser.

The use of the analysis engine in ShangriDocs requires a UMLS license.

UMLS license (username and password) can be obtained from: Obtain UMLS license here

This takes about 2 working days.

Improving performance of cTAKES

Advanced modification of cTAKES to improve performance and customize the annotated categories can be found in Creating New Types. This requires the developer's version of cTAKES (apache-ctakes-3.2.2-src.tar.gz). Please refer to the Developer Install Guide.

ShangriDocs’s main site

For convenience, the code of the current version of ShangriDocs on AWS is at https://github.com/selinachu/DUCC-cTAKES-AWS.git

From /home/ducc/

$ git clone https://github.com/selinachu/DUCC-cTAKES-AWS.git

Add UMLS username and password to CTAKESConfig.properties, located in /home/ducc/shangridocs/shangridocs-services/src/main/resources/CTAKESContentHandler/config/org/apache/tika/sax/

UMLSUser=[your_username]
UMLPass=[your_password]

Also, add UMLS username and password to shell variables

export ctakes_umlsuser=‘your_username’
export ctakes_umlspw=‘your_password’

This is taken from ShangriDocs Tika Server, but skipping step 3.

Note: This version of Shangridocs, does not require the ctakes-tika server.

$ cd /home/ducc/DUCC-cTAKES-AWS/shangridocs
$ git clone https://github.com/apache/tika.git
$ cd /home/ducc/DUCC-cTAKES-AWS/shangridocs/tika
$ java -jar tika-server/target/tika-server-1.11-SNAPSHOT.jar > ../tika-server.log 2>&1&

The Tika server will be on port 9998.

Now you are all set up to start ShangriDocs

From /home/ducc/apache-ctakes-3.2.2/desc/

Change all descriptor files with <multipleDeploymentAllowed> tag from false to true.

Note: A simple way of accomplishing this is by searching for all descriptor files under .../apache-ctakes-3.2.2/desc/ with <multipleDeploymentAllowed>false and perform replacements to <multipleDeploymentAllowed>true

Add type system information to FilesInDirectoryCollectionReader.xml in .../apache-ctakes-3.2.2/desc/ctakes-core/desc/collection_reader/

(Or FilesInDirectoryCollectionReader.xml from this repository)

<typeSystemDescription>
  <imports>
    <import name="org.apache.ctakes.typesystem.types.TypeSystem"/>
  </imports>
</typeSystemDescription>
<typePriorities/>
<fsIndexCollection/>
<capabilities>
  <capability>
    <inputs/>
    <outputs>
      <type allAnnotatorFeatures="true">org.apache.ctakes.typesystem.types.TypeSystem</type>
    </outputs>

Add type system information to XCasWriterCasConsumer.xml from ...pache-ctakes-3.2.2/desc/ctakes-core/desc/cas_consumer (Or XCasWriterCasConsumer.xml from this repository)

<import name="org.apache.ctakes.typesystem.types.TypeSystem"/>

Configurations are mainly defined in the DUCC properties file default.ducc.properties and the job description file ctakes.job

Replace the /home/ducc/apache-uima-ducc-2.0.1/resources/default.ducc.properties with this default.ducc.properties and run ducc post install script again.

Set these environment variables

export DUCC_HOME=[path to ducc]
export CTAKES_HOME=[path to ctakes]
export SHANGRIDOCS_HOME=[path to shangridocs]
export TIKA_HOME=[path to tika]

If following the set up instructions above, then the paths would be

export DUCC_HOME=“/home/ducc/apache-uima-ducc-2.0.1“
export CTAKES_HOME=“/home/ducc/apache-ctakes-3.2.2”
export SHANGRIDOCS_HOME=“/home/ducc/DUCC-cTAKES-AWS/shangridocs”
export TIKA_HOME=“/home/ducc/DUCC-cTAKES-AWS/shangridocs/tika”
$ cd /home/ducc/apache-uima-ducc-2.0./admin
$ ./start_ducc
$ cd /home/ducc/DUCC-cTAKES-AWS/shangridocs/tika
$ ./run.bash
$ cd /home/ducc/DUCC-cTAKES-AWS/shangridocs/shangridocs-webapp
$ ./mvn clean tomcat7:run&

Note: run.bash script starts the Tika server. This can also be accomplished by starting Tika server.

The multiple nodes needs to be defined in the following files: ducc.nodes and jobdriver.nodes under ...apache-uima-ducc-2.0.1/resources/

ducc-1.aws-hostname.com
ducc-2.aws-hostname.com
ducc-3.aws-hostname.com
ducc-4.aws-hostname.com

This sets up DUCC to run on a cluster, or more specifically connecting the head node with the other three work nodes. This allows the head node to send a job to one of its worker nodes.

DUCC does not automatically break up a large job to be run on multiple machines simultaneously. To accomplish this, it would require preprocessing of the document(s). The idea is to create separate set of CASes for each document and send them into the pipeline. This would be done by incorporating a custom Flow Controller to work inside the Aggregate Analysis Engine containing a CAS Multiplier. By routing a CAS to a CAS Multiplier, it permits the creation of new CASes. The Cas Multiplier and Analysis Engine threads can then be run in parallel.

An example related to this topic can be found from the DUCC Documentation. This explains the process to split a single text file, by using paragraphs as boundaries, to further segment the text into separate documents. Thus, breaking large files into multiple Work items.

Documentations related to DUCC's Flow Controller and UIMA's Flow Controllers with CAS Multipliers