/regexXML

A pure Python library for memory-efficient XML parsing

Primary LanguagePythonApache License 2.0Apache-2.0

regexXML

regexXML is a pure Python library for fast and memory-efficient XML parsing.

Python XML libaries such as lxml are extremely memory-demanding. regexXML simply uses regular expressions to extract elements from XML by a tag name, and to parse attributes of those elements. By doing so, regexXML is faster and consumes much less memory than lxml. regexXML allows you to parse gigabytes (or even terabytes) of XML files without a luxurious computer.

Installation

regexXML runs under Python 2 and 3.

You can install using pip.

pip install git+https://github.com/kyungtaekLIM/regexXML.git

Usage

To use regexXML, first import Attr and Tag classes.

from regexXML import Attr, Tag

Make regex (regular expression) objects of tag names you want to parse using Tag.

gene_re = Tag("gene")
name_re = Tag("name")
entry_re = Tag("entry")

If you know that the tags of interest are nested by the same tag name, set nested=True at the cost of speed. The innermost element will be parsed.

gene_re = Tag("gene", nested=True)
name_re = Tag("name", nested=True)
entry_re = Tag("entry", nested=True)

To get a single tag match from an XML string, use search, which will give you a match object. Once you get a match object, group() will give you the matched string. Its attribute and inner-XML can be extracted by group("attr") and group("inner"), respectively.

gene = gene_re.search(xml_string)
tag_string = gene.group()
attribute_string = gene.group("attr")
inner_xml_string = gene.group("inner")

To iterate over all matches, use finditer. Attr parses the attribute string and returns an OrderedDict object.

for name in name_re.finditer(gene.group("inner")):
    tag_string = name.group()
    attribute_string = name.group("attr")
    inner_xml_string = name.group("inner")
    
    attribute_dict = Attr(attribute_string)

If your XML file is huge, it is a bad idea to read the whole file at once. To prevent your computer from being low on memory, use finditer_from_file that reads chunks of a file to parse tags iteratively.

with open(filename, "r") as f:
    for entry in entry_re.finditer_from_file(f):
        tag_string = name.group()
        attribute_string = name.group("attr")
        inner_xml_string = name.group("inner")
    
        attribute_dict = Attr(attribute_string)

Examples

1) Parsing an XML string

Here is an example of parsing an XML document from Uniprot database (http://uniprot.org).

from regexXML import Attr, Tag

uniprot_xml = """
<?xml version="1.0" ?>
<uniprot schemaLocation="http://uniprot.org/uniprot http://www.uniprot.org/support/docs/uniprot.xsd">
 <entry created="2009-03-24" dataset="Swiss-Prot" modified="2017-03-15" version="59">
  <accession>B0TGS9</accession>
  <name>FMT_HELMI</name>
  <protein>
   <recommendedName>
    <fullName evidence="1">Methionyl-tRNA formyltransferase</fullName>
    <ecNumber evidence="1">2.1.2.9</ecNumber>
   </recommendedName>
  </protein>
  <gene>
   <name evidence="1" type="primary">fmt</name>
   <name type="ordered locus">Helmi_20650</name>
   <name type="ORF">HM1_2133</name>
  </gene>
  <organism>
   <name type="scientific">Heliobacterium modesticaldum (strain ATCC 51547 / Ice1)</name>
   <dbReference id="498761" type="NCBI Taxonomy"/>
   <lineage>
    <taxon>Bacteria</taxon>
    <taxon>Firmicutes</taxon>
    <taxon>Clostridia</taxon>
    <taxon>Clostridiales</taxon>
    <taxon>Heliobacteriaceae</taxon>
    <taxon>Heliobacterium</taxon>
   </lineage>
  </organism>
 </entry>
</uniprot>
"""

# we know that tags are not nested by the same name.
gene_re = Tag("gene", nested=False)
name_re = Tag("name", nested=False)

# search "gene" tag that comes first.
gene = gene_re.search(uniprot_xml)

# get the whole XML of the gene element.
print("# print the first gene element")
print("%s\n" % gene.group())

# get inner-XML of the gene element.
print("# print inner-XML of the gene element")
print("%s\n" % gene.group("inner"))

# get the attribute string of the gene element.
# Return None, if it does not exist.
print("# print attribute string of the gene element")
print("%s\n" % gene.group("attr"))

# iterate over name elements in the gene element.
for name in name_re.finditer(gene.group("inner")):
    print("# print a name element")
    print("%s\n" % name.group())

    print("# print inner-XML of the name element")
    print("%s\n" % name.group("inner"))

    print("# print the attribute string of the name element")
    print("%s\n" % name.group("attr"))
    
    # parse the attribute string into OrderedDict.
    attr = Attr(name.group("attr"))

    # get key-value pairs of the attribute.
    print("# print parsed attributes")
    for k, v in attr.items():
        print("key : %s\nvalue : %s\n" % (k, v))

2) Parsing a huge XML file iteratively.

An example of parsing a huge Uniport XML file (6G) downloaded from ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.xml.gz in a memory-efficient manner.

I compared CPU time and maximum memory usage required to run this script with the lxml equivalent. regexXML is superior to lxml.

Library CPU Time (s) Max Memory
regexXML 177 0.2G
lxml 197 0.4G

3) parsing a gzipped XML file iteratively.

If you want to parse the above XML file without decompressing it, just open the file using gzip library.

I also compared CPU time and maximum memory usage with the lxml equivalent. Again, regexXML is superior to lxml.

Library CPU Time (s) Max Memory
regexXML 176 0.2G
lxml 234 0.4G