tbanel/orgaggregate

Suggestion: separate group by parameters from the projection of the grouping operations

dmgerman opened this issue · 4 comments

This module is great. I have a suggestion that might make it better.

Aggregation is an operation with two parameters: the columns to aggregate by, and the operations on each of the subgroups. I think the syntax of the blocks would be cleaner if those two parameters are separated:

  • one parameter for the aggregation column (or columns, perhaps a future option)
  • one parameter for the actual projected values for each of the subgroups (which might include the aggregation column, or not, or perhaps not in the first location.

Just a thought. This is how SQL implements the operation: subgroup by group by attributes, and then project based on the SELECT statement the results for each subgroup)

Thanks for your interest!

I feel that you are on something. But I am not sure I understand what you mean.

"the blocks would be cleaner if those two parameters are separated"
Orgtb-aggregate already separates those two kind of parameters. In this example:

:cols "Day vmean(Level) vsum(Quantity)"
  • Day is of the first kind (aggregation),
  • vmean(Level) and vsum(Quantity) are of the second kind (operations on subgroups).

Do you mean that a different syntax would be clearer? Like for example:

:aggregation "Day" :operations "vmean(Level) vsum(Quantity)"

The current :cols syntax says something more. It gives what the resulting table should look like. So, the following two instructions give different results:

:cols "Day vmean(Level) vsum(Quantity)"
:cols "vmean(Level) vsum(Quantity) Day"

The resulting columns are ordered differently:

| Day     | vmean(Level) | vsum(Quantity) |
|---------+--------------+----------------|
| Monday  |         27.5 |             14 |
| Tuesday |           43 |             45 |
…

| vmean(Level) | vsum(Quantity) | Day     |
|--------------+----------------+---------|
|         27.5 |             14 | Monday  |
|           43 |             45 | Tuesday |
…

Ordering of resulting columns could not be specified with an hypothetical :aggregation and :operations syntax.

Maybe you could give some examples of what you are thinking about, both in Orgtb-Aggragate and SQL?

OK, clear.

yes, it would be the order of the operations. If you want day, add day to the operations:
,----
| :operation "day vmean(Level) vsum(Quantity)"
`----

OK. Currently, in Orgtbl-Aggregate, it is the other way around. If you do not want Day in the output, specify an invisibility formatting, as follow:

:cols "Day;<> …"

I guess the current model is such that any identifier without following left parenthesis is assumed to be the grouping attributes.

Yes, mostly. In the current model, a naked name (without parenthesis or operation) is assumed to be a grouping column.

and they have to the be the first in the list.

Not necessarily the first. A grouping column name may appear anywhere in the :cols specification.

select from vmean(level), vsum(quantity), day+10

Note that the resulting table does contain day +10 (not day), even though the grouping was done by this attribute.

You can do that in Orgtbl-Aggregate, although the result may not be exactly the same as in SQL:

:cols "Day;<> vmean(Level) vsum(Quantity) Day+10"

Here Day;<> is a naked name of an input column. Therefore, it acts as a grouping specification. In SQL it would appear in the group by clause. The invisibility formatting ;<> prevents it to be output.

Then Day+10 is not a naked column name. Therefore it acts as a formula to be applied on sub-groups. In SQL it would appear in the select clause.

The result in Orgtbl-Aggregate can be surprising. First because Day is not numeric. Second because it generates a list of identical values. Example:

[Tuesday + 10, Tuesday + 10, Tuesday + 10]

You may want to aggregate those values with an aggregator. For instance vmean(Day+10) (average value) or rdup(Day+10) (remove duplicates).

select from vmean(level), vsum(quantity), day+10
from table
group by day
order by level

and (at least in SQLlite, postgres would reject the query if level is not functionally dependent on day)

level would be actually a non-deterministic value. It simply chooses one tuple (usually the first) and extracts that value from it.

In Orgtbl-Aggregate, an ill-formed specification such as sorting on Level is not even possible. However sorting on the average of Level is possible, with a specification like this one:

:cols "… vmean(Level);^n …"

which stands for "ascending numerical sorting".

It might be possible to say, in addition to providing a list of org-mode table functions (vmean, etc). provide an elisp
function that takes the same number of parameters as the table, and returns a list of tuples (basically, a list of
tuples of the same size).

It might also be possible to have a function that takes one or more parameters and returns a value.

something like this:

,----
| :cols "vmean(Level) vsum(Quantity) somecomplexfunction(Level, Quantity)"
`----

or perhaps simply a elisp function that takes each row and generates a list of rows:

,----
| :cols elisp-function
`----

This can be achieved currently with the :post specification (not the :cols one). Excerpt from the documentation:


The aggregated table can be post-processed with the :post
parameter. It accepts a Lisp lambda, a Lisp function, or a Babel
block.

The process receives the aggregated table as parameter in the form of
a Lisp expression. It can process it in any way it wants, provided it
returns a valid Lisp table.

A Lisp table is a list of rows. Each row is either a list of cells, or
the special symbol hline.


To summarize, you would like Orgtbl-Aggregate to be more SQLish.

On one hand, this makes sense, as there are a lot of people who know SQL.

On the other hand, there is an installed user base. I cannot break compatibility. Moreover, duplicating SQL features would be an endless race. For example, SQL can act on several input tables, whereas Orgtbl-Aggregate uses a single input table.

If SQL advanced features are required, Org-Mode already provides Babel blocks with several flavors of back-end databases (SQLite, PostgreSQL, MySQL, and so on). Of course, this adds a dependency on an external tool, whereas Orgtbl-Aggregate is a pure Emacs package with absolutely no dependency.

Another path would be to make Orgtbl-Aggregate more R-ish (R is the open-source statistical language). Here too, this would be an endless race. And it would diverge from the SQLish path.

Orgtbl-Aggregate being an open-source project, you may want to fork it and extend it along the lines you described. Do not forget that this involves:

  • Elisp coding,
  • Documenting the new features without confusing the reader (the current documentation is already 1700 lines long),
  • Writing unit tests (there are currently 1500 lines of unit tests),
  • Maintaining it over years or decades.

Maybe an easier project would be to develop a small compiler to translate SQL to Orgtbl-Aggregate?

Thank you for your explanation. From what you describe it seems to be sufficient for most needs. I'll close the issue as resolved.