yfyf/xmlgames

In search of the fastest XML parser

Python

Dumphole for fast XML processors

Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz

Goal: parse publicfeed.huge.xml (~1 GB) as quickly as possible. Output all Products/Product/ProductUrl strings separated by newline.

Existing implementations:

XSLT via XALAN. OOM.
XSLT via xsltproc. OOM.
Custom parser (get_producturl.py) via pypy: 26.5s.
Custom parser (get_producturl.py) via rpython: 8.2

To be evaluated:

rewrite of get_producturl.py in C.
expat (or hexpat)
LXML in CPython.
xml.etree.ElementTree in PyPy.
xml.etree.cElementTree in CPython.
add JIT to get_producturl.py rpython.

Rules:

Streaming (for constant memory usage) for all implementations.
Warm filesystem cache (i.e. do it more than once).
Single-threaded. We are interested only about serial performance.
If you make your own parser, don't try to make it correct. Make it work.

Compile rpython application

/path/to/pypy/rpython/bin/rpython ./get_producturl.py

See tutorial for more info.