/suim

Analytic UIMA pipelines using Spark

Primary LanguageScalaApache License 2.0Apache-2.0

SUIM

Spark for Unstructured Information, provides a thin abstraction layer for UIMA
on top of Spark. SUIM leverages on Spark resilient distributed dataset (RDD) to run UIMA pipelines using uimaFIT, SUIM pipelines are distributed across the nodes on a cluster and can be operated on in parallel [1].

SUIM allows you to run analytical pipelines on the resulting (or intermediate) CAS to execute furhter text analytics or machine learning algorithms.

Examples

Count buildings from the UIMA tutorial.

Using the RoomAnnotator from the UIMA tutorial:

    val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
    val params = Seq(FileSystemCollectionReader.PARAM_INPUTDIR, "data")
    val rdd = makeRDD(createCollectionReader(classOf[FileSystemCollectionReader], params: _*), sc)
    val rnum = createEngineDescription(classOf[RoomNumberAnnotator])
    val rooms = rdd.map(process(_, rnum)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[RoomNumber]))
    val counts = rooms.map(room => room.getBuilding()).map((_,1)).reduceByKey(_ + _)
    counts.foreach(println(_))

If the collection is to large to fit in memory, or you already have a collection of SCASes use an HDFS RDD:

    val rdd = sequenceFile(reateCollectionReader(classOf[FileSystemCollectionReader], params: _*),
      "hdfs://localhost:9000/documents", sc)

Tokenize and count words with DKPro Core

Use DKPro Core [2] to tokenize and Spark to do token level analytics.

    val typeSystem = TypeSystemDescriptionFactory.createTypeSystemDescription()
    val rdd = makeRDD(createCollectionReader(classOf[TextReader],
      ResourceCollectionReaderBase.PARAM_SOURCE_LOCATION, "data",
      ResourceCollectionReaderBase.PARAM_LANGUAGE, "en",
      ResourceCollectionReaderBase.PARAM_PATTERNS,  Array("[+]*.txt")), sc)
    val seg = createPrimitiveDescription(classOf[BreakIteratorSegmenter])
    val tokens = rdd.map(process(_, seg)).flatMap(scas => JCasUtil.select(scas.jcas, classOf[Token]))
    val counts = tokens.map(token => token.getCoveredText())
      .filter(filter(_))
      .map((_,1)).reduceByKey(_ + _)
      .map(pair => (pair._2, pair._1)).sortByKey(true)
    counts.foreach(println(_))

Common Tasks

To build:

mvn compile

To run:

mvn scala:run

To test:

mvn test

To create standalone with dependencies:

mvn package
java -jar target/spark-uima-tools-0.0.1-SNAPSHOT-jar-with-dependencies.jar

References