This repository contains code for the XML to JSON Converter. This converter is written in Python and will convert one or more XML files into JSON / JSONL files
NOTE: This is a fork of davlee1972/xml_to_json
Converts XML to valid JSON or JSONL Requires only two files to get started. Your XML file and the XSD schema file for that XML file. Multiprocessing enabled to parse XML files concurrently if the XML files are in the same format. Call with -m # option. Uses Python's iterparse event based methods which enables parsing very large files with low memory requirements. This is very similar to Java's SAX parser Files are processed in order with the largest files first to optimize overall parsing time Option to write results to either Linux or HDFS folders
This module has dependencies which are outdated. Therefore it is recommended to install it in a isolated environment. Use this script to install it in /opt/xml_to_json/
and make it available as command on unix systems:
# build the python module
python3 -m venv .venv
.venv/bin/pip install --upgrade pip build wheel
make build
# install the package in /opt/xml_to_json/* and creates a symlink to /usr/bin/xml_to_json
make install
make remove
xml_to_json
usage: xml_to_json [-h] -x XSD_FILE [-o OUTPUT_FORMAT] [-s SERVER]
[-t TARGET_PATH] [-z] [-p XPATH] [-a ATTRIBPATHS]
[-e EXCLUDEPATHS] [-m MULTI] [-l LOG] [-v VERBOSE] [-n]
[-d]
...
XML To JSON Parser
positional arguments:
input_files files to convert or '-' to read from stdin
optional arguments:
-h, --help show this help message and exit
-x XSD_FILE, --xsd_file XSD_FILE
xsd file name
-o OUTPUT_FORMAT, --output_format OUTPUT_FORMAT
output format json or jsonl. Default is jsonl.
-s SERVER, --server SERVER
server with hadoop client installed if hadoop not
installed
-t TARGET_PATH, --target_path TARGET_PATH
target path. hdfs targets require hadoop client
installation. Examples: /proj/test, hdfs:///proj/test,
hdfs://halfarm/proj/test
-z, --zip gzip output file
-p XPATH, --xpath XPATH
xpath to parse out.
-a ATTRIBPATHS, --attribpaths ATTRIBPATHS
extra element attributes to parse out.
-e EXCLUDEPATHS, --excludepaths EXCLUDEPATHS
elements to exclude. pass in comma separated string.
/path/exclude1,/path/exclude2
-m MULTI, --multi MULTI
number of parsers. Default is 1.
-l LOG, --log LOG log file
-v VERBOSE, --verbose VERBOSE
verbose output level. INFO, DEBUG, etc.
-n, --no_overwrite do not overwrite output file if it exists already
-d, --delete_xml delete xml file after converting to json
xml_to_json -x PurchaseOrder.xsd PurchaseOrder.xml
INFO - 2018-03-20 11:10:24 - Parsing XML Files..
INFO - 2018-03-20 11:10:24 - Processing 1 files
INFO - 2018-03-20 11:10:24 - Parsing files in the following order:
INFO - 2018-03-20 11:10:24 - ['PurchaseOrder.xml']
DEBUG - 2018-03-20 11:10:24 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 11:10:24 - Parsing PurchaseOrder.xml
DEBUG - 2018-03-20 11:10:24 - Writing to file PurchaseOrder.json
DEBUG - 2018-03-20 11:10:24 - Completed PurchaseOrder.xml
Original XML
<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
<shipTo country="US">
<name>Alice Smith</name>
<street>123 Maple Street</street>
<city>Mill Valley</city>
<state>CA</state>
<zip>90952</zip>
</shipTo>
<billTo country="US">
<name>Robert Smith</name>
<street>8 Oak Avenue</street>
<city>Old Town</city>
<state>PA</state>
<zip>95819</zip>
</billTo>
<comment>Hurry, my lawn is going wild!</comment>
<items>
<item partNum="872-AA">
<productName>Lawnmower</productName>
<quantity>1</quantity>
<USPrice>148.95</USPrice>
<comment>Confirm this is electric</comment>
</item>
<item partNum="926-AA">
<productName>Baby Monitor</productName>
<quantity>1</quantity>
<USPrice>39.98</USPrice>
<shipDate>1999-05-21</shipDate>
</item>
</items>
</purchaseOrder>
JSON output (zip looks funny, but blame Microsoft which says zip is a decimal in the XSD file spec <xs:element name="zip" type="xs:decimal"/>)
{
"purchaseOrderorderDate":"1999-10-20",
"shipTo":{
"shipTocountry":"US",
"name":"Alice Smith",
"street":"123 Maple Street",
"city":"Mill Valley",
"state":"CA",
"zip":90952.0
},
"billTo":{
"billTocountry":"US",
"name":"Robert Smith",
"street":"8 Oak Avenue",
"city":"Old Town",
"state":"PA",
"zip":95819.0
},
"comment":"Hurry, my lawn is going wild!",
"items":{
"item":[
{
"itempartNum":"872-AA",
"productName":"Lawnmower",
"quantity":1,
"USPrice":148.95,
"comment":"Confirm this is electric"
},
{
"itempartNum":"926-AA",
"productName":"Baby Monitor",
"quantity":1,
"USPrice":39.98,
"shipDate":"1999-05-21"
}
]
}
}
Also zip output files, parse 3 files concurrently, only extract /PurchaseOrder/items/item elements and incrementally process one XML path at a time to save memory instead of trying to read the entire XML file into memory.
cp PurchaseOrder.xml 1.xml
cp 1.xml 2.xml
cp 1.xml 3.xml
cp 1.xml 4.xml
xml_to_json -o jsonl -m 3 -z -p /purchaseOrder/items/item -x PurchaseOrder.xsd *.xml
INFO - 2018-03-20 16:33:50 - Parsing XML Files..
INFO - 2018-03-20 16:33:50 - Processing 5 files
INFO - 2018-03-20 16:33:50 - Parsing files in the following order:
INFO - 2018-03-20 16:33:50 - ['1.xml', '2.xml', 'PurchaseOrder.xml', '4.xml', '3.xml']
DEBUG - 2018-03-20 16:33:50 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:50 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:50 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:50 - Parsing PurchaseOrder.xml
DEBUG - 2018-03-20 16:33:50 - Writing to file PurchaseOrder.jsonl.gz
DEBUG - 2018-03-20 16:33:50 - Parsing 1.xml
DEBUG - 2018-03-20 16:33:50 - Parsing 2.xml
DEBUG - 2018-03-20 16:33:50 - Writing to file 1.jsonl.gz
DEBUG - 2018-03-20 16:33:50 - Writing to file 2.jsonl.gz
DEBUG - 2018-03-20 16:33:51 - Parsing item from 1.xml
DEBUG - 2018-03-20 16:33:51 - Parsing item from 2.xml
DEBUG - 2018-03-20 16:33:51 - Parsing item from PurchaseOrder.xml
DEBUG - 2018-03-20 16:33:51 - Completed 2.xml
DEBUG - 2018-03-20 16:33:51 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:51 - Completed PurchaseOrder.xml
DEBUG - 2018-03-20 16:33:51 - Completed 1.xml
DEBUG - 2018-03-20 16:33:51 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:51 - Parsing 4.xml
DEBUG - 2018-03-20 16:33:51 - Writing to file 4.jsonl.gz
DEBUG - 2018-03-20 16:33:51 - Parsing 3.xml
DEBUG - 2018-03-20 16:33:51 - Writing to file 3.jsonl.gz
DEBUG - 2018-03-20 16:33:51 - Parsing item from 3.xml
DEBUG - 2018-03-20 16:33:51 - Parsing item from 4.xml
DEBUG - 2018-03-20 16:33:51 - Completed 3.xml
DEBUG - 2018-03-20 16:33:51 - Completed 4.xml
JSON output
ls -l *.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 1.jsonl.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 2.jsonl.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 3.jsonl.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 4.jsonl.gz
-rw-r--r-- 1 user users 203 Mar 20 16:26 PurchaseOrder.jsonl.gz
zcat *.jsonl.gz
{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}
{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}
{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}
{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}
{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}
Only attributes from elements found before the xpath can be include
xml_to_json -p /purchaseOrder/items/item -a /purchaseOrder,/purchaseOrder/shipTo -x PurchaseOrder.xsd PurchaseOrder.xml
JSON output
cat PurchaseOrder.jsonl
{"purchaseOrderorderDate": "1999-10-20", "shipTocountry": "US", "itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"purchaseOrderorderDate": "1999-10-20", "shipTocountry": "US", "itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}
This removes xpaths from your result
xml_to_json -e /purchaseOrder/comment,/purchaseOrder/items -x PurchaseOrder.xsd PurchaseOrder.xml
JSON output
cat PurchaseOrder.jsonl
{"purchaseOrder": {"purchaseOrderorderDate": "1999-10-20", "shipTo": {"shipTocountry": "US", "name": "Alice Smith", "street": "123 Maple Street", "city": "Mill Valley", "state": "CA", "zip": 90952.0}, "billTo": {"billTocountry": "US", "name": "Robert Smith", "street": "8 Oak Avenue", "city": "Old Town", "state": "PA", "zip": 95819.0}}}