statisticsnorway/java-vtl

Proposal for a more powerful aggregation syntax

hadrienk opened this issue · 1 comments

The syntax described by the version 1.1 of the VTL specification lacks functionalities essential to a practical usage of the language.
The aggregate form is not very consistent with the join statement and does not play well with the concept of "expression lifting" the user manual describes. Additionally, the impossibility to use more than one aggregation function per expression dictates the use of workaround such as:

dsA := sum(ds1.a) group by x
dsB := sum(ds1.b) group by x
dsC := sum(ds1.c) group by x

ds2 := [dsA, dsB, dsC] {
   a := dsA.a,
   b := dsB.b,
   c := dsC.c
}

The following proposals aim to solve those problems by extending the VTL Syntax. The need to be able to aggregate using several functions on several components could be solved by reusing the join with body construct:

ds1 := [ds2 group by comp1, comp2] {
  /* aggregate body */
}
ds1 := [ds2 group along comp1, comp2] {
  /* aggregate body */
}
ds1 := [ds2 group with hierarchy ds3 on comp1, comp2] {
  /* aggregate body */
}

The consistent use of the group keyword makes it easier to recognize that the expression is an aggregation whereas the prefixes by comp1, comp2 and along comp1, comp2 expresses clearly the intent and provide an easy way to select the components.

The with hierarchy ds3 on comp1, comp2 makes it possible to use another dataset as a graph representing the groups we want to aggregate upon. This construct might be hard to implement if the graph includes signs as it is proposed by the VTL 1.1 specification but it makes it easier to apprehend for the language users since it can be related to other forms of aggregations.

Another syntax could be with hierarchy(ds3) on comp1, comp2. This leaves the possibility to use statically defined aggregation rule/hierarchies and makes it clear that we are transforming a dataset so that it becomes an aggregation rule/hierarchies.

Using an aggregate form of the join syntax would then mean that variables accessible within the scope of the aggregation body are different. One possibility is that the components become lists instead of a scalars. Provided that we implement some special list/set functions, the aggregation function would become regular VTL functions:

/* ds2 contains the components id, measure1 and measure2 */
ds1 := [ds2 group by id] {
  first_value := measure1[0],
  sum_value := sum(measure2)
  average_value := sum(measure2)/count(measure2)
  /* etc. */
}

Another approach is to make all aggregated components a dataset with one measure, allowing deeper constructs. It has the benefit of building upon the normal aggregation functions that are defined by the specification and the concept of lifting :

/* ds2 contains the components id and measure1, measure2 */
ds1 := [ds2 group by id] {
  value := [measure1, measure2] {
    value := measure1 + measure2
  }
  aggregated := sum(value)
  aggregated_with_lifting := sum(measure1 + measure2)
}

Regular join and aggregation could be used together:

ds1 := get("...")
ds2 := get("...")

ds_agg := [ds1, ds2 on comp1, comp2 group by comp2] {
  /* aggregate */
}

The other clauses would work the same way:

ds1 := get("...")
ds2 := get("...")

ds_agg := [ds1, ds2 on id1, id2 group by id2] {
  /* aggregate */
  filter count(measure2) > 0
  sum_value := sum(measure1) + sum(measure2)
  average_value := sum(measure2)/count(measure2)
  unfold id1, average_value to "id1", "id2",
  rename id1 to avg_id1, id2 to avg_id2, 
  unfold id1, average_value to "id1", "id2",
  rename id1 to sum_id1, id2 to sum_id2, 
  keep average_value, sum_id2
}

Notes for further reflexion. Page 81 of the General description - the example given use a block to transform a liftable expression:

D1.Total + size(D2) 

/* equivalent to */

{
  V := size(D2)
  D1.Total + V /* liftable */
}

/* lifting */

{
  V := size(D2)
  [D1] {
    filter D1.Total is not null,
    Total := D1.Total + V /* should be $V ? */
  }
}

This raise the following questions:

  1. should {} constructs be treated as a simple statement?
  2. could then [datasetExpr] { clause, clause* } expressed as [datasetExpre] statement ?
  3. does it imply that datasetExpr, datasetExpr is the join operator?
  4. couldn't the following sum(ds1.x), avg(ds1.y) group by z be lifted the same way?