This libxml2 wrapper is intended as a small experiment in C++ interface design and various tricks relating to C++11, while still being useful.
It attempts to closely mimic the lxml Python library's ElementTree interface and all the perks that entails (e.g. uniform namespace and text handling), while making use of some modern C++ design ideas, such as automatic memory and ownership management. It is intended to work without surprises, in other words all libxml2 management quirks should be hidden.
When built with a C++11 compiler, uniform initializer lists may be used in various places to specify attribute lists:
using etree::Element;
using etree::SubElement;
Element make_status_msg()
{
Element msg("status", {
{"first-attribute", "first-value"},
{"{urn:some-namepace}attr2", "second-value"}
});
SubElement(msg, "system-load").text(get_system_load());
return msg;
}
The etree::feed
namespace includes an incomplete RSS/ATOM feed
parser/generator. It is a work in progress, but can already parse some basic
pieces.
Text passed to the API is assumed to be in UTF-8 format. Text returned by the API is in UTF-8 format wrapped in a std::string.
The library and underlying libxml2 implementation are thread safe in that determistic behavior is guaranteed only so long as library objects aren't shared among threads. This is due to the library internally performing unlocked reference counting using the underlying libxml2 DOM structures.
It is only possible to achieve thread safe behaviour if a single thread at a time is considered owner for all live objects referencing a given document. For a thread to pass an object to another thread, it must relinquish all remaining references it holds on that document beforehand.
cpp-ElementTree consists of two files: element.hpp
and element.cpp
. The
implementation file must be linked against libxml2 somehow during the build.
The etree::feed
implementation relies on various POSIX-related time parsing
functions that aren't found on Windows. In order to avoid a large external
dependency (e.g. Boost), etree::feed
is UNIX-only for the time being.
- Given the choice of repeatedly heap-allocating potentially short-lived
proxies, or casting a
void *
for use as an integer, the latter path was chosen.
Element objects are one word, i.e. 8 bytes on 64bit, and may be copied very cheaply. Copying an element copies only this word and causes a reference count to be incremented.
Element, ElementTree and AttrMap each call ref() during construction and unref() during destruction.
When called on a non-document node, ref() treats the node's private user data pointer as an integer and increments it. If it was previously 0, ref() calls ref() again on the node's document.
When called on a non-documnt node, unref() decrements the pointer. When it reaches 0, unref() calls unref() again on the node's document.
When unref() on a document node reaches zero, ::xmlFreeDoc() is invoked to destroy the document.
Due to this approach, it is possible to update a node's associated document (e.g. during append(), remove(), graft()) without having to update every Element value in existence, since regardless of how many exist, only one ref() was ever called on the node document. Once an element has moved to a new document, the mutation function need only call unref() once on the old document and ref() once on the new document.
- Remove items from Horrors section.
- Use libxml2 parser string interning.
- Preserve namespace prefixes better.
- Disable libxml2 stderr logs (seemingly requires TLS tricks).
- Fix up const usage everywhere (findall/removeall/etc)
- Internally copy XPathContext for each thread (e.g. boost::thread_local_ptr)
- etree::tostring() should copy up namespaces to subelements like lxml
- Make child/attr iterators mutation-safe
- Handle comments better.
- Rewrite & better tests for Element::graft().
- reparent() must update refcounts when the document has changed, if there are any exitent Elements for deeply nested child nodes.