Temporary name and kind of a POC for now
An UIMA-like... but lightweight, modern, concurrent and distributed by nature
It lets you build simple (but powerful) flows to leverage your unstructured data.
A flow can combine multiple analysis engines and each flow can be branched into another flow.
Branches are run concurrently, making flows efficient and smart.
Branches can also be merged and go to another flow... Unlimited power!
Oh and by the way, flows and analysis engines are distributed-ready by design. Nice right?
In Natural Language Processing, you could make something like this:
The bucket is the way of communication between analysis engines.
You put everything you want in it. And the next engines in the flow will be able to get it if they need it.
For now this is what buckets look like:
case class Bucket(source: String, data: List[Data]) {
def add(d: Data) = Bucket(source, data :+ d)
def add(d: List[Data]) = Bucket(source, data ++ d)
def select[T: ClassTag] = data.collect({case t: T => t})
def merge(b: Bucket) = Bucket(b.source, (data ++ b.data).distinct)
}
This is an immutable structure.
Create your engines by implementing the AnalysisEngine
trait and its process
method :
import leafy.analysis.AnalysisEngine
import leafy.models.{Annotation, Bucket}
class WhitespaceTokenizer extends AnalysisEngine {
def process(b: Bucket): Bucket = {
var cursor = 0
var tokens = List[Annotation]()
b.source.split(" ").foreach { x =>
tokens = tokens :+ Annotation(cursor, cursor + x.length, x)
cursor = cursor + x.length + 1
}
b.add(tokens)
}
}
Annotations and other results must be put inside the bucket.
And the last expression must be a bucket.
To create your own data type that can be put in a bucket, extends Data
:
import leafy.models.Data
case class NamedEntity(text: String) extends Data
You can chain several analysis engines through a flow:
import leafy._
import leafy.flow.Flow
Flow("My source text #data", AE[WhitespaceTokenizer], AE[NamedEntityRecognition])
The first parameter is the text data which will be processed.
The others are all the engines you want to use.
It returns a future bucket containing all the data set by the engines.
import leafy._
import leafy.flow.Flow
val startFlow = Flow("My source text #data", AE[WhitespaceTokenizer], AE[NamedEntityRecognition])
val branch0 = Flow.branch(startFlow, AE[SomeAE], AE[AnotherAE])
val branch1 = Flow.branch(startFlow, AE[Stuff], AE[OkWhyNot])
Each branch will run concurrently.
import leafy._
import leafy.flow.Flow
// ...
val merged = Flow.merge(branch0, branch1, branch2)
// You can branched it to another flows
val continue = Flow.branch(merged, AE[...])
// @todo, explain how works this beauty 💄😘
- Better bucket structure for large processing
- Simple and sexy resources management (avoid the UIMA ugly way)
- Maybe switch to akka-stream... not sure yet
Scala, Akka <3