High performance parallelized data transformation processing pipeline.
At the moment, the focus is on columnar data (in Scala represented as Map[String, Any]
.
- CSV file with headers
- CSV file without headers
- JSON records file (one JSON value per line)
- ElasticSearch index
To demonstrate the easiest use case, the following example does perform simple ETL-like data transformation taking CSV file as an input and indexing it to the ElasticSearch test
index as documents of type foo
.
import akka.actor.ActorSystem
import pipeline.in.CsvActor
import pipeline.model.EsClusterAddress
import pipeline.out.EsActor
import scala.io.Source
object TestApp extends App {
val system = ActorSystem("test")
// Output
val cluster = EsClusterAddress(
name = "elasticsearch",
hosts = List(("localhost", 9300))
)
val out = system.actorOf(EsActor.props("test", "foo", cluster), "out")
// Input
val source = Source.fromFile("test.csv")
val in = system.actorOf(CsvActor.props(source, out), "in")
in ! CsvActor.Start
system.awaitTermination()
}
Using the Akka actors to perform individual steps of the processing, one is able to plugin intermediate step, which can do any type of transformation (mapping, reducing, data types manipulation, etc.).