/xml-in

your friendly XML navigator

Primary LanguageClojure

xml-in

your friendly XML navigator

Clojars Project

What

XML is this new hot markup language everyone is raving about. Attributes, namespaces, schemas, security, XSL.. what's there not to love.

xml-in is not about parsing XML, but rather working with already parsed XML.

It takes heavily nested {:tag .. :attrs .. :content [...]} structures that Clojure XML parsers produce and helps to navigate these structures in a Clojure "get-in style" using internal and custom transducers.

  • clojure/data.xml is an example of a good and lazy Clojure/ClojureScript XML parser
  • funcool/tubax is another example of a ClojureScript XML parser

Why

XML navigation in Clojure is usually done with help of zippers. clojure/data.zip is usially used, and a common navigation looks like this:

(data.zip/xml1-> (clojure.zip/xml-zip parsed-xml)
                 :universe
                 :system
                 :delta-orionis
                 :δ-ori-aa1
                 :radius
                 data.zip/text)

There is a great article "XML for fun and profit" that shows how zippers are used to navigate XML DOM trees.

But we can do better: faster, cleaner, composable and "no zippers".

How much faster? Let's see:

zippers:

=> (time (dotimes [_ 250000] 
           (data.zip/xml1-> (clojure.zip/xml-zip parsed-xml)
                            :universe
                            :system
                            :delta-orionis
                            :δ-ori-aa1
                            :radius
                            data.zip/text)))
"Elapsed time: 13385.563442 msecs"

xml-in:

=> (time (dotimes [_ 250000]
     (xml/find-first parsed-xml [:universe
                                 :system
                                 :delta-orionis
                                 :δ-ori-aa1
                                 :radius])))
"Elapsed time: 765.884111 msecs"

Property based navigation

Here is an XML document all the examples in this documentation are based on:

<?xml version="1.0" encoding="UTF-8"?>
<universe>
  <system>
    <solar>
      <planet age="4.543" inhabitable="true">Earth</planet>
      <planet age="4.503">Mars</planet>
    </solar>
    <delta-orionis>
      <constellation>Orion</constellation>
      <δ-ori-aa1>
        <mass>24</mass>
        <radius>16.5</radius>
        <luminosity>190000</luminosity>
        <surface-gravity>3.37</surface-gravity>
        <temperature>29500</temperature>
        <rotational-velocity>130</rotational-velocity>
      </δ-ori-aa1>
    </delta-orionis>
  </system>
</universe>

it lives in dev-resources/universe.xml

Since xml-in works with a parsed XML (e.g. a DOM tree), let's parse it once and call it the "universe":

=> (require '[clojure.data.xml :as dx])
=> (def universe (dx/parse-str (slurp "dev-resources/universe.xml")))
#'boot.user/universe

it gets parsed into a common nested {:tag :attrs :content} structure that looks like this:

=> (pprint universe)
{:tag :universe,
 :attrs {},
 :content
 ("\n  "
  {:tag :system,
   :attrs {},
   :content
   ("\n    "
    {:tag :solar,
     :attrs {},
     :content
     ("\n      "
      {:tag :planet,
      ;; ...
      ;; ...

One way to access child nodes in this XML document is to use "a vector of nested properties".

For example, let's check out "those two" planets in a solar system.

Bringing xml-in in:

=> (require '[xml-in.core :as xml])

and

=> (xml/find-all universe [:universe :system :solar :planet])
("Earth" "Mars")

All the planets are returned. In case we need "a" planet we can match the first one and stop searching:

=> (xml/find-first universe [:universe :system :solar :planet])
("Earth")

notice find-all vs. find-first

All matching vs. The first matching

Even if there is only one element that matches a search criteria it is best not to look for it using find-all since there is a cost of looking at all the child nodes that are on the same level as a matched element.

Let's look at the example. From the XML above, let's find a radius of δ-ori-aa1 component of the delta-orionis star system:

=> (xml/find-all universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
("16.5")
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
("16.5")

Both find-all and find-first return the same exact value, but we know for a fact that the δ-ori-aa1 component has only one radius. Which means it is best found with find-first rather than find-all.

Let's see the performance difference:

=> (time (dotimes [_ 250000]
           (xml/find-all universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])))
"Elapsed time: 1216.927309 msecs"
=> (time (dotimes [_ 250000]
           (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])))
"Elapsed time: 792.958283 msecs"

Quite a difference. The secret is quite simple: find-first stops searching once it finds a matching element. But it does improve performance, especially for a large number of XML documents.

NOTE: find-first returns a "seq", and not just a "single" value, so it can be composed as described in Creating sub documents

Functional navigation

Navigation using functions, or rather transducers, adds custom "predicate batteries" to the process.

A few internal batteries are included in xml-in:

=> (require '[xml-in.core :as xml :refer [tag= some-tag= attr=]])
  • tag= finds child nodes under all matched tags

  • some-tag= finds child nodes under the first matching tag

  • attr= finds child nodes under all tags with attribute's key and value

Let's find all inhabitable planets of the solar system to the best of our knowledge (i.e. based on the XML above):

=> (xml/find-in universe [(tag= :universe)
                          (tag= :system)
                          (some-tag= :solar)
                          (attr= :inhabitable "true")])
("Earth")

a find-in function takes a parsed XML and a sequence of transducers and computes a sequence from the application of all the transducers composed

Since find-in does not need to create transducers like find-all and find-first it is a bit more performant:

=> (time (dotimes [_ 250000] (xml/find-first universe [:universe :system :solar :planet])))
"Elapsed time: 507.325005 msecs"

vs.

=> (time (dotimes [_ 250000] (xml/find-in universe [(some-tag= :universe)
                                                    (some-tag= :system)
                                                    (some-tag= :solar)
                                                    (some-tag= :planet)])))
"Elapsed time: 467.535705 msecs"

Creating sub documents

Let's say we need to get several properties out of the δ-ori-aa1 component. We can do it as:

=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :mass])
("24")
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
("16.5")
=> (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :surface-gravity])
("3.37")

we can of course group [:mass :radius :surface-gravity] together and map over them to call xml/find-first universe with a prefix, but it would not change the fact that we would need to "get-into" ":universe :system :delta-orionis :δ-ori-aa1" on every property lookup.

We can do better: navigate to :universe :system :delta-orionis :δ-ori-aa1 once and treat is as a document instead:

=> (def aa1 (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1]))
#'boot.user/aa1
=> (xml/find-first aa1 [:mass])
("24")
=> (xml/find-first aa1 [:radius])
("16.5")
=> (xml/find-first aa1 [:surface-gravity])
("3.37")

to create a sub document no special syntax is needed, just search "upto" the new root element.

and in cases where it is applicable, using a sub document is a bit faster:

=> (time (dotimes [_ 100000]
           [(xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :mass])
            (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :radius])
            (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1 :surface-gravity])]))

"Elapsed time: 973.376399 msecs"

vs.

=> (time (dotimes [_ 100000]
           (let [aa1 (xml/find-first universe [:universe :system :delta-orionis :δ-ori-aa1])]
             [(xml/find-first aa1 [:mass])
              (xml/find-first aa1 [:radius])
              (xml/find-first aa1 [:surface-gravity])])))

"Elapsed time: 760.332762 msecs"

License

Copyright © 2019 tolitius

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.