API Docs: spac-core | xml-spac | json-spac
Streaming Parser Combinators is a Scala library for turning streams of "parser events" into strongly-typed data, for:
- XML via javax.xml.stream, a.k.a. "StAX"
- JSON via Jackson
using handlers that are:
- Declarative - You write what you want to get, not how to get it.
- Immutable - Parsers that you create have no internal state.
- Composable - Combine and transform parsers to handle complex data structures.
- Fast - With minimal abstraction to get in the way, speed rivals any hand-written handler.
- Streaming - Parse huge XML documents from events, not a DOM.
You can jump into a full tutorial, or check out the examples, but here's a taste of how you'd write a parser for a relatively-complex blog post XML structure:
val PostParser = (
XMLParser.forMandatoryAttribute("date").map(commentDateFormat.parseLocalDate) and
XMLSplitter(* \ "author").first[Author] and
XMLSplitter(* \ "stats").first[Stats] and
XMLSplitter(* \ "body").first.asText and
XMLSplitter(* \ "comments" \ "comment").asListOf[Comment]
).as(Post)
Add the following to your build.sbt
file:
libraryDependencies += "io.dylemma" %% "xml-spac" % "0.8"
libraryDependencies += "io.dylemma" %% "json-spac" % "0.8"
libraryDependencies += "io.dylemma" %% "spac-core" % "0.8"
SPaC is about handling streams of events, possibly transforming that stream, and eventually consuming it.
ConsumableLike[-Resource, +In]
is the typeclass used to represent a "Stream", showing how aResource
can be treated as an Iterator/Traversable ofIn
events. Implementations are provided forString
,InputStream
,Reader
, andFile
resources.Transformer[-In, +In2]
is a stream processing step that converts a stream ofIn
events to a stream ofIn2
events.XMLTransformer[+In2]
is an alias forTransformer[XMLEvent, In2]
JsonTransformer[+In2]
is an alias forTransformer[JsonEvent, In2]
Parser[-In, +Out]
is a stream processing step that consumes a stream ofIn
events to a singleOut
value.XMLParser[+Out]
is an alias forParser[XMLEvent, Out]
JsonParser[+Out]
is an alias forParser[JsonEvent, Out]
Splitter[In, +Context]
is a building block forTransformer
s andParser
s. It "splits" a stream ofIn
events into a stream of streams ofIn
events, where each "substream" is associated with aContext
value. The idea here is that if you know how to parse a certain sequence of events, you can easily extend that knowledge to parse a repetition of that sequence of events. You can also think of Splitter as a stream-based analog to an XPath.XMLSplitter
is available for xml-specific splitter semanticsJsonSplitter
is available for json-specific splitter semantics
Instances of Transformer
, Parser
, and Splitter
are immutable, meaning they can safely be
reused and shared at any time, even between multiple threads.
It's common to define an implicit val fooParser: XMLParser[Foo] = /* ... */
XMLParser.forMandatoryAttribute("foo")
is a parser which will find the "foo" attribute of the first element it sees.
<!-- file: elem.xml -->
<elem foo="bar" />
val xml = new File("elem.xml")
val parser: XMLParser[String] = XMLParser.forMandatoryAttribute("foo")
val result: String = parser.parse(xml)
assert(result == "bar")
Suppose you have some XML with a bunch of <elem foo="..."/>
and you want the "foo" attribute from each of them.
This is a job for a Splitter. You write an XMLSplitter
sort of like an XPATH, to describe how to get to each element that you want to parse.
With the XML below, we want to parse the <root>
element, since it represents the entire file.
We'll write our splitter by using the *
matcher (representing the current element),
then selecting <elem>
elements that are its direct children, using * \ elem
.
<!-- file: root.xml -->
<root>
<elem foo="bar" />
<elem foo="baz" />
</root>
val xml = new File("root.xml")
val splitter: XMLSplitter[Unit] = XMLSplitter(* \ "elem")
val transformer: XMLTransformer[String] = splitter map parser
val rootParser: XMLParser[List[String]] = transformer.parseToList
val root: List[String] = rootParser.parse(xml)
assert(root == List("bar", "baz"))
Check out the docs for ContextMatcherSyntax,
which defines helpers for creating the arguments to a Splitter
, like the *
value used above.
The underlying abstraction for processing "streams" is Handler
.
Handler
is allowed to be mutable, so that implementations can use utilities like Builder
.
Parser
and Transformer
remain immutable by acting as factories for Handler
.
trait Handler[-In, +Out] {
def isFinished: Boolean
def handleInput(input: In): Option[Out]
def handleError(err: Throwable): Option[Out]
def handleEnd(): Out
}
While processing a "stream", the handleInput
method will be called for each In
event.
The handler can indicate an early completion by returning Some(out)
,
or indicate it is ready for more input by returning None
.
At the end of the stream, handleEnd
is used to force the handler to return an output.
When you call parser.parse(source)
, the source
is opened by an implicit ConsumableLike
,
which then feeds events from the opened source into a fresh Handler
until the handler
indicates an early return, or the stream reaches its end, at which point the source
is closed.
trait ConsumableLike[-S, +In]{
def getIterator(resource: S): Iterator[In] with AutoCloseable
def apply[Out](source: S, handler: Handler[In, Out]): Out
}
The apply
method asks the source
(stream) to drive the handler
until it produces a result Out
.
There are many different ConsumableLike
instances already, including generalized ones for Iterable
collections and
Iterator
s, and XML-specific ones for String
, File
, and InputStream
. If you have a more specific "Stream" type,
you can write your own ConsumableLike[StreamType, EventType]
.
Here's how the core classes act like handler factories:
trait Parser[-In, +Out] extends (Any => Parser[In, Out]) {
def makeHandler(): Handler[In, Out]
}
trait Transformer[-In, +In2] extends (Any => Transformer[In, In2] {
def makeHandler[Out](downstream: Handler[In2, Out]): Handler[In, Out]
}