Intel(R) Core(TM) i7-4500U CPU @ 1.80GHz
Goal: parse publicfeed.huge.xml
(~1 GB) as quickly as possible. Output all
Products/Product/ProductUrl
strings separated by newline.
Existing implementations:
- XSLT via XALAN. OOM.
- XSLT via xsltproc. OOM.
- Custom parser (get_producturl.py) via pypy: 26.5s.
- Custom parser (get_producturl.py) via rpython: 8.2
To be evaluated:
- rewrite of
get_producturl.py
in C. - expat (or hexpat)
- LXML in CPython.
- xml.etree.ElementTree in PyPy.
- xml.etree.cElementTree in CPython.
- add JIT to
get_producturl.py
rpython.
Rules:
- Streaming (for constant memory usage) for all implementations.
- Warm filesystem cache (i.e. do it more than once).
- Single-threaded. We are interested only about serial performance.
- If you make your own parser, don't try to make it correct. Make it work.
/path/to/pypy/rpython/bin/rpython ./get_producturl.py
See tutorial for more info.