Demonstration of XML parsing using the StackOverflow data dump.
This is a simple Spark app that reads a Posts.xml
input file from
one of the
StackExchange data dumps;
the XML schema description can be found
here.
The code attempts to parse one row
XML element in each line; if a
row is parsed, its Body
, CreationDate
, and ViewCount
attributes
are queried. For each successful parse, a compact JSON record is
written onto a single line in the files of the output directory.