/group-by

an XQuery library for computing aggregates over tuples in MarkLogic Server

Primary LanguageXQueryOtherNOASSERTION

group-by

An XQuery library for computing aggregates over tuples in MarkLogic Server

requires MarkLogic Server v7 or greater

available through mlpm:

mlpm install group-by

depends on https://github.com/joemfb/cts-extensions

MarkLogic Server provides a huge API surface for all kinds of queries, but selecting tabular result-sets and computing additional columns (i.e., performing SQL-style queries in XQuery) tends to create highly-coupled and overly complicated code.

simple example: select the average salary and median bonus for each position at each company in a set of personnel records of the following structure:

<record>
  <name>John Doe</name>
  <position>rock star ninja</position>
  <company>awesome startup</company>
  <salary>50000</salary>
  <bonus>5000</bonus>
</record>
for $tuple in cts:value-tuples((
  cts:element-reference(xs:QName("position")),
  cts:element-reference(xs:QName("company"))
))
let $values := json:array-values($tuple)
let $query := cts:and-query((
  cts:element-range-query(xs:QName("position"), "=", $values[1]),
  cts:element-range-query(xs:QName("company"), "=", $values[2])
))
return (
  json:array-push($tuple,
    cts:avg-aggregate(cts:element-reference(xs:QName("salary")), (), $query)),
  json:array-push($tuple,
    cts:median(cts:values(cts:element-reference(xs:QName("bonus")), $query))),
  $tuple
)

Given the sample document (and the appropriate index configuration), that query would generate this result:

[ 'rock star ninja', 'awesome startup', '50000', '5000' ]

While this example query is readable enough, the complexity of this approach increases dramatically as query requirements proliferate.

In contrast, the group-by library drastically simplifies these types of operations, providing a declarative syntax for aggregation strategies and computations.

cts:group-by((
  cts:column(cts:element-reference(xs:QName("position"))),
  cts:column(cts:element-reference(xs:QName("company"))),
  cts:compute("avg", cts:element-reference(xs:QName("salary"))),
  cts:compute("median", cts:element-reference(xs:QName("bonus")))
))
implementation note:
Both of these queries depend on the cts:value-tuples function, which returns each unique tuple from a set of lexicons: an "*n*-way co-occurrence" query. Each tuple represents a set of lexicon values that occur together in one-or-more fragments (usually documents). Documents containing multiple values from the same lexicon can lead to unexpected results, as there will be more than one tuple for such a document. Field and path range indices can be used to mitigate this effect, as they enable more granular element or attribute selection.

In the spirit of browser polyfills, this module is placed in the http://marklogic.com/cts namespace. It provides three query strategies: cts:group-by, cts:cross-product, and cts:cube (warning, experimental). Each accepts (up to) three parameters:

parameter purpose
$type-defs query specification composed of cts:row, cts:column, and cts:compute functions
$options optional sequence of option strings formatted as "_option-name_=_option-value_".
$query optional cts:query object constraining the results
options purpose
cts:value-tuples options controls how tuples are generated from cts:row and cts:column specifications
headers boolean, include column headers in results array
format array or map, return results as a json:array or a map:map object
cts:group-by(
  $type-defs as function(*)*,
  [$options as xs:string*],
  [$query as cts:query?]
) as item()*
notes
  • cts:row and cts:column are treated identically by cts:group-by
  • regardless of the order in the specification, all rows are processed before all columns
  • each computation is constrained to the values of the rows and columns in that array
  • cts:group-by is processed recursively, allowing for complex and arbitrarily-nested result-sets (see advanced queries).
example output
[
  'row-1_value-1',
  'row-2_value-1',
  'column-1_value-1',
  'compute-1'
]
cts:cross-product(
  $type-defs as function(*)*,
  [$options as xs:string*],
  [$query as cts:query?]
) as item()*
notes
  • cts:cross-product calculates columns over rows
  • regardless of the order in the specification, all rows are processed before all columns; both rows and columns are required
  • each column is constrained to the row values preceding it
  • each computation is constrained in the same manner as the columns, and additionally to the values of the columns in a given array
example output
[
  'row-1_value-1',
  'row-2_value-1',
  [ 
    [ 'column-1_value-1', 'compute-1' ],
    [ 'column-1_value-2', 'compute-1' ],
    [ ... ]
  ]
]
cts:cube(
  $type-defs as function(*)*,
  [$options as xs:string*],
  [$query as cts:query?]
) as item()*
notes

Warning: cts:cube currently is in an experimental state, and can be very computationally expensive. It's output, while logically nested, is arcane and difficult to use. In the future, it will denormalize the results into a useful tabular structure.

  • cts:cube is intended to allow for fine grained OLAP-style cube processing.
  • regardless of the order in the specification, all rows are processed before all columns; both rows and columns are required
  • each column is constrained to the row values preceding it, and the row values in ancestor arrays
  • each computation is constrained in the same manner as the columns, and additionally to the values of the columns in a given array
example output
[
  [
    'row-1_value-1',
    'compute-1',
    [
      [ 'column-1_value-1', 'compute-1' ],
      [ 'column-1_value-2', 'compute-1' ],
      [ ... ]
    ],
    [
      'row-2_value-1',
      'compute-1',
      [
        [ 'column-1_value-1', 'compute-1' ],
        [ 'column-1_value-2', 'compute-1' ],
        [ ... ]
      ]
    ]
  ]
]

query specification

Query specifications are created from three function types: cts:column, cts:row, and cts:compute. Each type constructs a named cts:reference to a lexicon.

cts:column(
  $reference as cts:reference
) as (function() as element(cts:column))

cts:column(
  $alias as xs:string,
  $reference as cts:reference
) as (function() as element(cts:column))
cts:row(
  $reference as cts:reference
) as (function() as element(cts:row))
cts:row(
  $alias as xs:string,
  $reference as cts:reference
) as (function() as element(cts:row))
notes
  • the names "rows" and "columns" are understood in the spirit of spreadsheet pivot tables
  • for cts:column#1 and cts:row#1, a default alias is created from the properties of the cts:reference (such as the element name, or path expression)
  • they are treated identically by cts:group-by; the distinction between them determines the nesting of result-sets in cts:cross-product and cts:cube

In a addition to a reference1, cts:compute requires the name of an aggregate function, which is applied to the referenced lexicon.

[1] Note: three built-in aggregates require a sequence of two lexicons

cts:compute(
  $function as xs:string,
  $reference as cts:reference*,
) as (function() as element(cts:compute))

or

cts:compute(
  $alias as xs:string,
  $function as xs:string,
  $reference as cts:reference*,
  [$options as xs:string*]
) as (function() as element(cts:compute))

parameters

$alias

for cts:compute#2, a default alias is constructed: the function name is concatenated with the alias created from the cts:reference properties; ex: avg-<element-name>

$function

$function must be either the name of one of the built-in aggregate functions (as named in the Search, REST and Java APIs), or a user-defined aggregate function, formatted as native/_plugin-name_/_function-name_.

Note: three of the built-in aggregate functions require two references

built-in aggregate functions $reference cardinality
avg 1
correlation 2
count 1
covariance 2
covariance-population 2
max 1
median 1
min 1
stddev 1
stddev-population 1
sum 1
variance 1
variance-population 1
$reference

All of the built-in aggregate functions require the $reference parameter to be in reference to a lexicon of numeric type, except for count, which accepts a $reference of any type.

  • int
  • unsignedInt
  • long
  • unsignedLong
  • float
  • double
  • decimal
$options

The $options parameter is passed to cts:aggregate, except in the case of the median function, when it is passed to cts:values.

FIXME

additional modules

group-by-json (http://marklogic.com/cts/group-by/json)

Warning: this implementation is unstable and likely to change.

grpj:query($query)

evaluate map / JSON serialized cts:group-by query definitions

params
  • $query as map:map or json:object

REST extension: http://marklogic.com/rest-api/resource/group-by

MarkLogic REST API extension for JSON cts:group-by queries over any database

POST /v1/resources/group-by

evaluates a JSON-serialized cts:group-by query

params
  • rs:database as xs:string?: optionally specify the name of a database to query

License Information

group-by
  • Copyright (c) 2014 Joseph Bryan. All Rights Reserved.
  • Copyright (c) 2014 Gary Vidal. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

[http://www.apache.org/licenses/LICENSE-2.0] (http://www.apache.org/licenses/LICENSE-2.0)

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

The use of the Apache License does not indicate that this project is affiliated with the Apache Software Foundation.