/spark-xml-parse

Demonstration of XML parsing using the StackOverflow data dump.

Primary LanguageScala

spark-xml-parse

Demonstration of XML parsing using the StackOverflow data dump.

Overview

This is a simple Spark app that reads a Posts.xml input file from one of the StackExchange data dumps; the XML schema description can be found here.

The code attempts to parse one row XML element in each line; if a row is parsed, its Body, CreationDate, and ViewCount attributes are queried. For each successful parse, a compact JSON record is written onto a single line in the files of the output directory.