The game ain't in me no more. None of it.
xmlcutty is a simple tool for carving out elements from large XML files, fast. Since it works in a streaming fashion, it uses almost no memory and can process around 1G of XML per minute.
Why? Background.
Use a deb or rpm release. It's in AUR, too.
Or install with the go tool:
$ go get github.com/miku/xmlcutty/cmd/xmlcutty
$ cat fixtures/sample.xml
<a>
<b>
<c></c>
</b>
<b>
<c></c>
</b>
</a>
Options:
$ xmlcutty -h
Usage of xmlcutty:
-path string
select path (default "/")
-rename string
rename wrapper element to this name
-root string
synthetic root element
-v show version
It looks a bit like XPath, but it really is only a simple matcher.
$ xmlcutty -path /a fixtures/sample.xml
<a>
<b>
<c></c>
</b>
<b>
<c></c>
</b>
</a>
You specify a path, e.g. /a/b
and all elements matching this path are printed:
$ xmlcutty -path /a/b fixtures/sample.xml
<b>
<c></c>
</b>
<b>
<c></c>
</b>
You can end up with an XML document without a root. To make tools like xmllint happy, you can add a synthetic root element on the fly:
$ xmlcutty -root hello -path /a/b fixtures/sample.xml | xmllint --format -
<?xml version="1.0"?>
<hello>
<b>
<c></c>
</b>
<b>
<c></c>
</b>
</hello>
Rename wrapper element - that is the last element of the matching path:
$ xmlcutty -rename beee -path /a/b fixtures/sample.xml
<beee>
<c></c>
</beee>
<beee>
<c></c>
</beee>
All options, synthetic root element and a renamed path element:
$ xmlcutty -root hi -rename ceee -path /a/b/c fixtures/sample.xml | xmllint --format -
<?xml version="1.0"?>
<hi>
<ceee/>
<ceee/>
</hi>
It will parse XML files without a root element just fine.
$ head fixtures/oai.xml
<record>
<header>
<identifier>oai:arXiv.org:0704.0004</identifier>
<datestamp>2007-05-23</datestamp>
<setSpec>math</setSpec>
</header>
<metadata>
<oai_dc:dc xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"... >
<dc:title>A determinant of Stirling cycle numbers counts ...
<dc:type>text</dc:type>
<dc:identifier>http://arxiv.org/abs/0704.0004</dc:identifier>
...
This is an example XML response from a web service. We can slice out the
identifier elements. Note that any namespace - here oai_dc
- is completely
ignored for the sake of simplicity:
$ cat fixtures/oai.xml | xmlcutty -root x -path /record/metadata/dc/identifier \
| xmllint --format -
<?xml version="1.0"?>
<x>
<identifier>http://arxiv.org/abs/0704.0004</identifier>
<identifier>http://arxiv.org/abs/0704.0010</identifier>
<identifier>http://arxiv.org/abs/0704.0012</identifier>
</x>
We can go a bit further and extract the text element, which is like a poor man
text()
in XPath terms. By using the a newline as argument to rename, we
effectively get rid of the enclosing XML tag:
$ cat fixtures/oai.xml | xmlcutty -rename '\n' -path /record/metadata/dc/identifier \
| grep -v "^$"
http://arxiv.org/abs/0704.0004
http://arxiv.org/abs/0704.0010
http://arxiv.org/abs/0704.0012
This last feature is nice to quickly extract text from large XML files.