This project provides a MapReduce job for converting XMLs to AVRO using a predefined Avro Schema.
The approach works as follows:
- The CombinedXmlInputFormat ensures that multiple XML documents can be processed in one mapper.
- Mahout’s XMLInputFormat splits up the document in records that are delimited by the specified
XmlElementName
begin and end tag. - In the mapper the DatumBuilder transforms each XML to an Avro Record.
During transformation the XML is traversed recursively. For each element a lookup for the corresponding entry in the Avro schema is performed.
The DatumBuilder requires an additional "source"
attribute in the Avro Schema, that refers to either an XML element or attribute, e.g. "source" : "element price"
.
For a full example see books.xml and the respective test XmlToAvroMapReduceJobTest.
The project can be build with maven as follows:
mvn compile assembly:single
This creates a jar in target/xml-to-avro-mapreduce-jar-with-dependencies.jar
that can be executed as MapReduce job.
The job takes four arguments:
yarn jar target/xml-to-avro-mapreduce-jar-with-dependencies.jar <XmlElementName> </path/to/avro-schema.avsc> <inputPaths> <outputPath>
- XmlElementName: The name of the XML Element to use for splitting. The mapper processes each XML element separately.
- /path/to/avro-schema.avsc: HDFS location of the Avro Schema used for conversion.
- inputPaths: Comma-separated list of HDFS input paths
- outputPath: HDFS output directory. MapReduce requires that the directory does not exist prior to execution.
Furthermore there are optional Hadoop configs that can be used:
-D xmltoavro.skip.missing.elements={true|false}
: Skip XML elements that are missing in the Avro Schema and do not throw an Error.-D xmltoavro.skip.missing.attributes={true|false}
: Skip XML attributes that are missing in the Avro Schema and do not throw an Error.
Upload avro schema:
hdfs dfs -mkdir -p /data/book-catalog/metadata/
hdfs dfs -copyFromLocal test/resources/book-catalog/book.avsc /data/book-catalog/metadata/
Upload sample xml:
hdfs dfs -mkdir -p /data/book/raw/
hdfs dfs -copyFromLocal test/resources/book-catalog/books.xml /data/book/raw/
Execute MR Job:
yarn jar target/xml-to-avro-mapreduce-jar-with-dependencies.jar book /data/book-catalog/metadata/book.avsc /data/book/raw/ /data/book/processed/
Create Hive table:
hive -f test/resources/book-catalog/book-ddl.sql
Execute Hive query:
hive -e "SELECT author, sum(size(review)) AS num_reviews FROM book GROUP BY author ORDER BY num_reviews desc;"
XmlInputFormat has been copied from the Apache Mahout Project (Apache License):
The classes in the ly.stealth package are based on the xml-avro github project (Apache License):