vega/vega-dataflow

Extent

domoritz opened this issue · 4 comments

It would be great to have an aggregation function that computes the difference between the max and the min value.

Is extent a good name? Other alternatives are spread or range.

jheer commented

Why not simply generate max and min and then compute the difference? I'm sure an extent function could be slightly more convenient, but I'm not sure that justifies the additional surface area.

Makes sense. However, it is quite inconvenient in Vega-Lite to sort marks by the aggregate.

For example,

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {"url": "data/barley.json"},
  "mark": "bar",
  "encoding": {
    "x": {
      "aggregate": "sum",
      "field": "yield",
      "type": "quantitative"
    },
    "y": {
      "field": "variety",
      "type": "nominal",
      "sort": {"op": "extent","field": "yield"}
    }
  }
}

The only way to make this work is to compute the aggregation outside

{
  "$schema": "https://vega.github.io/schema/vega-lite/v2.json",
  "data": {"url": "data/barley.json"},
  "mark": "bar",
  "transform": [
    {"summarize": [{"aggregate": "min", "field": "yield", "as": "min_yield"}, {"aggregate": "max", "field": "yield", "as": "max_yield"}],
      "groupby": ["variety"]},
    {"calculate": "datum.max_yield - datum.min_yield", "as": "extent"}
  ],
  "encoding": {
    "x": {
      "aggregate": "sum",
      "field": "yield",
      "type": "quantitative"
    },
    "y": {
      "field": "variety",
      "type": "nominal",
      "sort": {"op": "min","field": "extent"}
    }
  }
}

Even if we don't add "extent", this is an interesting example.

jheer commented

Thanks @domoritz, now I understand the motivation for this a bit better. However, I don't think this is a scalable strategy here. We could add an extent (or span) operation for max - min. Next, someone (quite reasonably) also wants the IQR span (q3 - q1). So we could add that. And so on and so on. As a result I don't think adding Vega-level aggregates is a good strategy.

If you'd like to extend Vega-Lite to include additional aggregate ops that then compile to an aggregate + formula at the Vega level, that could be one option. A more attractive option might be to allow aggregate formulas (e.g., max - min) in addition to aggregate functions.

Also, FWIW I think of extent as referring to [min, max] (a 2-tuple) and span as referring to the magnitude of the extent (max - min).

I like the idea of aggregate formulas but we can address that in Vega 3.1 and Vega-Lite 2.1.

The distinction between extent and span makes sense. I will adopt those terms.