logstash-plugins/logstash-filter-xml

Handling xpath results

wiibaa opened this issue · 1 comments

Background

From the XPath 1.0 W3C Recommendation:

The primary syntactic construct in XPath is the expression. [...] An expression is evaluated to yield an object, which has one of the following four basic types:

  • node-set (an unordered collection of nodes without duplicates)
  • boolean (true or false)
  • number (a floating-point number)
  • string (a sequence of UCS characters)

From the XPath 3.1 W3C Recommendation:

Sequences

An important characteristic of the data model is that there is no distinction between an item (a node, function, or atomic value) and a singleton sequence containing that item. An item is equivalent to a singleton sequence containing that item and vice versa.
A sequence may contain any mixture of nodes, functions, and atomic values.
[...] Sequences replace node-sets from XPath 1.0. In XPath 1.0, node-sets do not contain duplicates.

Sequences were introduced in XPath 2.0.

It’s useful—or at least, interesting—to establish the relevant XPath version in this context.

I am using Elastic Stack with Logstash 5.2.1. On the system where Elastic Stack is installed, entering the following Unix command:

find / -name "xpath"

returns:

/opt/logstash/vendor/bundle/jruby/1.9/gems/nokogiri-1.7.0.1-java/lib/nokogiri/xml/xpath

The corresponding version-specific Nokogiri web page contains a list of features that includes:

XPath 1.0 support for document searching

Further reading appears to confirm that the xpath setting in the Logstash xml filter supports a subset of XPath 1.0.

With that in mind—specifically, this:

node-set (an unordered collection of nodes without duplicates)

It’s interesting (to me 🙂 ) that the Logstash xpath returns an array: that is, an ordered collection.

My two cents

Ideally, Logstash should honor the spec (the XPath 1.0 W3C Recommendation), and return the corresponding (Ruby) data types.

I write “ideally” because, in the context of XPath 1.0 and an expression that yields a node-set, this would mean changing the existing behavior of the Logstash xpath to return a hash (an unordered collection, corresponding to a node-set) instead of an array. It’s more pragmatic to still return an array in this case. Looking ahead, this is also a better fit for sequence, which is an ordered collection.