/leafy

:leaves: An UIMA-like... but lightweight, concurrent and distributed by nature

Primary LanguageScalaMIT LicenseMIT

leafy

Temporary name and kind of a POC for now

An UIMA-like... but lightweight, modern, concurrent and distributed by nature

Why leafy?

It lets you build simple (but powerful) flows to leverage your unstructured data.

A flow can combine multiple analysis engines and each flow can be branched into another flow.

Branches are run concurrently, making flows efficient and smart.

Branches can also be merged and go to another flow... Unlimited power!

Oh and by the way, flows and analysis engines are distributed-ready by design. Nice right?

Illustration

In Natural Language Processing, you could make something like this:

simpleflowleafy

Usage

Bucket

The bucket is the way of communication between analysis engines.
You put everything you want in it. And the next engines in the flow will be able to get it if they need it.

For now this is what buckets look like:

case class Bucket(source: String, data: List[Data]) {
  def add(d: Data) = Bucket(source, data :+ d)
  def add(d: List[Data]) = Bucket(source, data ++ d)
  def select[T: ClassTag] = data.collect({case t: T => t})
  def merge(b: Bucket) = Bucket(b.source, (data ++ b.data).distinct)
}

This is an immutable structure.

Analysis engine

Create your engines by implementing the AnalysisEngine trait and its process method :

import leafy.analysis.AnalysisEngine
import leafy.models.{Annotation, Bucket}

class WhitespaceTokenizer extends AnalysisEngine {
  
  def process(b: Bucket): Bucket = {
    var cursor = 0
    var tokens = List[Annotation]()

    b.source.split(" ").foreach { x =>
      tokens = tokens :+ Annotation(cursor, cursor + x.length, x)
      cursor = cursor + x.length + 1
    }

    b.add(tokens)
  }
}

Annotations and other results must be put inside the bucket.
And the last expression must be a bucket.

Data type

To create your own data type that can be put in a bucket, extends Data:

import leafy.models.Data

case class NamedEntity(text: String) extends Data

Flow

Simple flows

You can chain several analysis engines through a flow:

import leafy._
import leafy.flow.Flow

Flow("My source text #data", AE[WhitespaceTokenizer], AE[NamedEntityRecognition])

The first parameter is the text data which will be processed.
The others are all the engines you want to use.

It returns a future bucket containing all the data set by the engines.

Branch flows

import leafy._
import leafy.flow.Flow

val startFlow = Flow("My source text #data", AE[WhitespaceTokenizer], AE[NamedEntityRecognition])

val branch0 = Flow.branch(startFlow, AE[SomeAE], AE[AnotherAE])
val branch1 = Flow.branch(startFlow, AE[Stuff], AE[OkWhyNot])

branches

Each branch will run concurrently.

Merge flows

import leafy._
import leafy.flow.Flow

// ...

val merged = Flow.merge(branch0, branch1, branch2)

// You can branched it to another flows
val continue = Flow.branch(merged, AE[...])

merged

Architecture

// @todo, explain how works this beauty 💄😘

To do

  • Better bucket structure for large processing
  • Simple and sexy resources management (avoid the UIMA ugly way)
  • Maybe switch to akka-stream... not sure yet

Powered by

Scala, Akka <3