/xml_to_json

Primary LanguagePythonMIT LicenseMIT

XML To JSON Converter

This repository contains code for the XML to JSON Converter. This converter is written in Python and will convert one or more XML files into JSON / JSONL files

NOTE: This is a fork of davlee1972/xml_to_json

Key Features

Converts XML to valid JSON or JSONL Requires only two files to get started. Your XML file and the XSD schema file for that XML file. Multiprocessing enabled to parse XML files concurrently if the XML files are in the same format. Call with -m # option. Uses Python's iterparse event based methods which enables parsing very large files with low memory requirements. This is very similar to Java's SAX parser Files are processed in order with the largest files first to optimize overall parsing time Option to write results to either Linux or HDFS folders

How to use this repository

How to install?

This module has dependencies which are outdated. Therefore it is recommended to install it in a isolated environment. Use this script to install it in /opt/xml_to_json/ and make it available as command on unix systems:

# build the python module
python3 -m venv .venv
.venv/bin/pip install --upgrade pip build wheel
make build

# install the package in /opt/xml_to_json/* and creates a symlink to /usr/bin/xml_to_json
make install

How to remove?

make remove

How to run?

xml_to_json

Parameters

usage: xml_to_json [-h] -x XSD_FILE [-o OUTPUT_FORMAT] [-s SERVER]
                   [-t TARGET_PATH] [-z] [-p XPATH] [-a ATTRIBPATHS]
                   [-e EXCLUDEPATHS] [-m MULTI] [-l LOG] [-v VERBOSE] [-n]
                   [-d]
                   ...

XML To JSON Parser

positional arguments:
  input_files           files to convert or '-' to read from stdin

optional arguments:
  -h, --help            show this help message and exit
  -x XSD_FILE, --xsd_file XSD_FILE
                        xsd file name
  -o OUTPUT_FORMAT, --output_format OUTPUT_FORMAT
                        output format json or jsonl. Default is jsonl.
  -s SERVER, --server SERVER
                        server with hadoop client installed if hadoop not
                        installed
  -t TARGET_PATH, --target_path TARGET_PATH
                        target path. hdfs targets require hadoop client
                        installation. Examples: /proj/test, hdfs:///proj/test,
                        hdfs://halfarm/proj/test
  -z, --zip             gzip output file
  -p XPATH, --xpath XPATH
                        xpath to parse out.
  -a ATTRIBPATHS, --attribpaths ATTRIBPATHS
                        extra element attributes to parse out.
  -e EXCLUDEPATHS, --excludepaths EXCLUDEPATHS
                        elements to exclude. pass in comma separated string.
                        /path/exclude1,/path/exclude2
  -m MULTI, --multi MULTI
                        number of parsers. Default is 1.
  -l LOG, --log LOG     log file
  -v VERBOSE, --verbose VERBOSE
                        verbose output level. INFO, DEBUG, etc.
  -n, --no_overwrite    do not overwrite output file if it exists already
  -d, --delete_xml      delete xml file after converting to json

Convert a small XML file to a JSON file

xml_to_json -x PurchaseOrder.xsd PurchaseOrder.xml

INFO - 2018-03-20 11:10:24 - Parsing XML Files..
INFO - 2018-03-20 11:10:24 - Processing 1 files
INFO - 2018-03-20 11:10:24 - Parsing files in the following order:
INFO - 2018-03-20 11:10:24 - ['PurchaseOrder.xml']
DEBUG - 2018-03-20 11:10:24 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 11:10:24 - Parsing PurchaseOrder.xml
DEBUG - 2018-03-20 11:10:24 - Writing to file PurchaseOrder.json
DEBUG - 2018-03-20 11:10:24 - Completed PurchaseOrder.xml

Original XML

<?xml version="1.0"?>
<purchaseOrder orderDate="1999-10-20">
    <shipTo country="US">
        <name>Alice Smith</name>
        <street>123 Maple Street</street>
        <city>Mill Valley</city>
        <state>CA</state>
        <zip>90952</zip>
    </shipTo>
    <billTo country="US">
        <name>Robert Smith</name>
        <street>8 Oak Avenue</street>
        <city>Old Town</city>
        <state>PA</state>
        <zip>95819</zip>
    </billTo>
    <comment>Hurry, my lawn is going wild!</comment>
    <items>
        <item partNum="872-AA">
            <productName>Lawnmower</productName>
            <quantity>1</quantity>
            <USPrice>148.95</USPrice>
            <comment>Confirm this is electric</comment>
        </item>
        <item partNum="926-AA">
            <productName>Baby Monitor</productName>
            <quantity>1</quantity>
            <USPrice>39.98</USPrice>
            <shipDate>1999-05-21</shipDate>
        </item>
    </items>
</purchaseOrder>

JSON output (zip looks funny, but blame Microsoft which says zip is a decimal in the XSD file spec <xs:element name="zip" type="xs:decimal"/>)

{   
   "purchaseOrderorderDate":"1999-10-20",
   "shipTo":{   
      "shipTocountry":"US",
      "name":"Alice Smith",
      "street":"123 Maple Street",
      "city":"Mill Valley",
      "state":"CA",
      "zip":90952.0
   },
   "billTo":{   
      "billTocountry":"US",
      "name":"Robert Smith",
      "street":"8 Oak Avenue",
      "city":"Old Town",
      "state":"PA",
      "zip":95819.0
   },
   "comment":"Hurry, my lawn is going wild!",
   "items":{   
      "item":[   
         {   
            "itempartNum":"872-AA",
            "productName":"Lawnmower",
            "quantity":1,
            "USPrice":148.95,
            "comment":"Confirm this is electric"
         },
         {   
            "itempartNum":"926-AA",
            "productName":"Baby Monitor",
            "quantity":1,
            "USPrice":39.98,
            "shipDate":"1999-05-21"
         }
      ]
   }
}

Convert an entire directory of XML files to JSONL

Also zip output files, parse 3 files concurrently, only extract /PurchaseOrder/items/item elements and incrementally process one XML path at a time to save memory instead of trying to read the entire XML file into memory.

cp PurchaseOrder.xml 1.xml
cp 1.xml 2.xml
cp 1.xml 3.xml
cp 1.xml 4.xml

xml_to_json -o jsonl -m 3 -z -p /purchaseOrder/items/item -x PurchaseOrder.xsd *.xml

INFO - 2018-03-20 16:33:50 - Parsing XML Files..
INFO - 2018-03-20 16:33:50 - Processing 5 files
INFO - 2018-03-20 16:33:50 - Parsing files in the following order:
INFO - 2018-03-20 16:33:50 - ['1.xml', '2.xml', 'PurchaseOrder.xml', '4.xml', '3.xml']
DEBUG - 2018-03-20 16:33:50 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:50 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:50 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:50 - Parsing PurchaseOrder.xml
DEBUG - 2018-03-20 16:33:50 - Writing to file PurchaseOrder.jsonl.gz
DEBUG - 2018-03-20 16:33:50 - Parsing 1.xml
DEBUG - 2018-03-20 16:33:50 - Parsing 2.xml
DEBUG - 2018-03-20 16:33:50 - Writing to file 1.jsonl.gz
DEBUG - 2018-03-20 16:33:50 - Writing to file 2.jsonl.gz
DEBUG - 2018-03-20 16:33:51 - Parsing item from 1.xml
DEBUG - 2018-03-20 16:33:51 - Parsing item from 2.xml
DEBUG - 2018-03-20 16:33:51 - Parsing item from PurchaseOrder.xml
DEBUG - 2018-03-20 16:33:51 - Completed 2.xml
DEBUG - 2018-03-20 16:33:51 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:51 - Completed PurchaseOrder.xml
DEBUG - 2018-03-20 16:33:51 - Completed 1.xml
DEBUG - 2018-03-20 16:33:51 - Generating schema from PurchaseOrder.xsd
DEBUG - 2018-03-20 16:33:51 - Parsing 4.xml
DEBUG - 2018-03-20 16:33:51 - Writing to file 4.jsonl.gz
DEBUG - 2018-03-20 16:33:51 - Parsing 3.xml
DEBUG - 2018-03-20 16:33:51 - Writing to file 3.jsonl.gz
DEBUG - 2018-03-20 16:33:51 - Parsing item from 3.xml
DEBUG - 2018-03-20 16:33:51 - Parsing item from 4.xml
DEBUG - 2018-03-20 16:33:51 - Completed 3.xml
DEBUG - 2018-03-20 16:33:51 - Completed 4.xml

JSON output

ls -l *.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 1.jsonl.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 2.jsonl.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 3.jsonl.gz
-rw-r--r-- 1 user users 191 Mar 20 16:26 4.jsonl.gz
-rw-r--r-- 1 user users 203 Mar 20 16:26 PurchaseOrder.jsonl.gz

zcat *.jsonl.gz

{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}

{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}

{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}

{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}

{"itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}

Add additional attributes from other elements

Only attributes from elements found before the xpath can be include

xml_to_json -p /purchaseOrder/items/item -a /purchaseOrder,/purchaseOrder/shipTo -x PurchaseOrder.xsd PurchaseOrder.xml

JSON output

cat PurchaseOrder.jsonl

{"purchaseOrderorderDate": "1999-10-20", "shipTocountry": "US", "itempartNum": "872-AA", "productName": "Lawnmower", "quantity": 1, "USPrice": 148.95, "comment": "Confirm this is electric"}
{"purchaseOrderorderDate": "1999-10-20", "shipTocountry": "US", "itempartNum": "926-AA", "productName": "Baby Monitor", "quantity": 1, "USPrice": 39.98, "shipDate": "1999-05-21"}

Exclude xpath elements

This removes xpaths from your result

xml_to_json -e /purchaseOrder/comment,/purchaseOrder/items -x PurchaseOrder.xsd PurchaseOrder.xml

JSON output

cat PurchaseOrder.jsonl

{"purchaseOrder": {"purchaseOrderorderDate": "1999-10-20", "shipTo": {"shipTocountry": "US", "name": "Alice Smith", "street": "123 Maple Street", "city": "Mill Valley", "state": "CA", "zip": 90952.0}, "billTo": {"billTocountry": "US", "name": "Robert Smith", "street": "8 Oak Avenue", "city": "Old Town", "state": "PA", "zip": 95819.0}}}