Suggestion: separate group by parameters from the projection of the grouping operations

Question

Suggestion: separate group by parameters from the projection of the grouping operations

dmgerman opened this issue 2 years ago · 4 comments

This module is great. I have a suggestion that might make it better.

Aggregation is an operation with two parameters: the columns to aggregate by, and the operations on each of the subgroups. I think the syntax of the blocks would be cleaner if those two parameters are separated:

one parameter for the aggregation column (or columns, perhaps a future option)
one parameter for the actual projected values for each of the subgroups (which might include the aggregation column, or not, or perhaps not in the first location.

Just a thought. This is how SQL implements the operation: subgroup by group by attributes, and then project based on the SELECT statement the results for each subgroup)

Answer 1 · 2023-04-01T07:13:51.000Z

Thanks for your interest!

I feel that you are on something. But I am not sure I understand what you mean.

"the blocks would be cleaner if those two parameters are separated"
Orgtb-aggregate already separates those two kind of parameters. In this example:

:cols "Day vmean(Level) vsum(Quantity)"

Day is of the first kind (aggregation),
vmean(Level) and vsum(Quantity) are of the second kind (operations on subgroups).

Do you mean that a different syntax would be clearer? Like for example:

:aggregation "Day" :operations "vmean(Level) vsum(Quantity)"

The current :cols syntax says something more. It gives what the resulting table should look like. So, the following two instructions give different results:

:cols "Day vmean(Level) vsum(Quantity)"
:cols "vmean(Level) vsum(Quantity) Day"

The resulting columns are ordered differently:

| Day     | vmean(Level) | vsum(Quantity) |
|---------+--------------+----------------|
| Monday  |         27.5 |             14 |
| Tuesday |           43 |             45 |
…

| vmean(Level) | vsum(Quantity) | Day     |
|--------------+----------------+---------|
|         27.5 |             14 | Monday  |
|           43 |             45 | Tuesday |
…

Ordering of resulting columns could not be specified with an hypothetical :aggregation and :operations syntax.

Maybe you could give some examples of what you are thinking about, both in Orgtb-Aggragate and SQL?

Answer 2 · 2023-04-01T22:11:55.000Z

Hi Tbanel, tbanel ***@***.***> writes:

Thanks for your interest! I feel that you are on something. But I am not sure I understand what you mean. "the blocks would be cleaner if those two parameters are separated" Orgtb-aggregate already separates those two kind of parameters. In this example: :cols "Day vmean(Level) vsum(Quantity)" * Day is of the first kind (aggregation), * vmean(Level) and vsum(Quantity) are of the second kind (operations on subgroups). Do you mean that a different syntax would be clearer? Like for example: :aggregation "Day" :operations "vmean(Level) vsum(Quantity)"

yes, exactly

The current :cols syntax says something more. It gives what the resulting table should look like. So, the following two instructions give different results: :cols "Day vmean(Level) vsum(Quantity)" :cols "vmean(Level) vsum(Quantity) Day" The resulting columns are ordered differently: | Day | vmean(Level) | vsum(Quantity) | |---------+--------------+----------------| | Monday | 27.5 | 14 | | Tuesday | 43 | 45 | … | vmean(Level) | vsum(Quantity) | Day | |--------------+----------------+---------| | 27.5 | 14 | Monday | | 43 | 45 | Tuesday | … Ordering of resulting columns could not be specified with an hypothetical :aggregation and :operations syntax.

yes, it would be the order of the operations. If you want day, add day to the operations: ,---- | :operation "day vmean(Level) vsum(Quantity)" `----

Maybe you could give some examples of what you are thinking about, both in Orgtb-Aggragate and SQL?

I guess the current model is such that any identifier without following left parenthesis is assumed to be the grouping attributes. and they have to the be the first in the list. But in SQL, this information is kept separate. Let us assume a query to a single table in SQL: ,---- | select <projection> | from <table> | where <condition> | group by <groupingAttributes> | order by <ordering> `---- The execution of this query is as follows: 1. Do the where clause to remove non-matching tuples 2. Do the aggregation by the aggregation attributes 3. Do the projection on each of the groups created in 2 4. Order the results according to the <ordering> This allows for queries such as: dependent on day) ,---- | select from vmean(level), vsum(quantity), day+10 | from <table> | group by day | order by level `---- and (at least in SQLlite, postgres would reject the query if level is not functionally dependent on day) ,---- | select from day, level, vmean(level), vsum(quantity), day+10 | from <table> | group by day | order by level `---- level would be actually a non-deterministic value. It simply chooses one tuple (usually the first) and extracts that value from it. Note that the resulting table does contain day +10 (not day), even though the grouping was done by this attribute. I think there is another advantage to separate the grouping attributes. It might be possible to say, in addition to providing a list of org-mode table functions (vmean, etc). provide an elisp function that takes the same number of parameters as the table, and returns a list of tuples (basically, a list of tuples of the same size). It might also be possible to have a function that takes one or more parameters and returns a value. something like this: ,---- | :cols "vmean(Level) vsum(Quantity) somecomplexfunction(Level, Quantity)" `---- or perhaps simply a elisp function that takes each row and generates a list of rows: ,---- | :cols elisp-function `---- hopefully this clarifies it. ,---- | -- | Daniel M. German "Architecture will always express the technical and social progress of the country in which it is carried out." | Oscar Niemeyer | http://turingmachine.org/ | http://silvernegative.com/ | dmg (at) uvic (dot) ca | replace (at) with @ and (dot) with . | `----

Answer 3 · 2023-04-02T08:49:27.000Z

OK, clear.

yes, it would be the order of the operations. If you want day, add day to the operations:
,----
| :operation "day vmean(Level) vsum(Quantity)"
`----

OK. Currently, in Orgtbl-Aggregate, it is the other way around. If you do not want Day in the output, specify an invisibility formatting, as follow:

:cols "Day;<> …"

I guess the current model is such that any identifier without following left parenthesis is assumed to be the grouping attributes.

Yes, mostly. In the current model, a naked name (without parenthesis or operation) is assumed to be a grouping column.

and they have to the be the first in the list.

Not necessarily the first. A grouping column name may appear anywhere in the :cols specification.

select from vmean(level), vsum(quantity), day+10

Note that the resulting table does contain day +10 (not day), even though the grouping was done by this attribute.

You can do that in Orgtbl-Aggregate, although the result may not be exactly the same as in SQL:

:cols "Day;<> vmean(Level) vsum(Quantity) Day+10"

Here Day;<> is a naked name of an input column. Therefore, it acts as a grouping specification. In SQL it would appear in the group by clause. The invisibility formatting ;<> prevents it to be output.

Then Day+10 is not a naked column name. Therefore it acts as a formula to be applied on sub-groups. In SQL it would appear in the select clause.

The result in Orgtbl-Aggregate can be surprising. First because Day is not numeric. Second because it generates a list of identical values. Example:

[Tuesday + 10, Tuesday + 10, Tuesday + 10]

You may want to aggregate those values with an aggregator. For instance vmean(Day+10) (average value) or rdup(Day+10) (remove duplicates).

select from vmean(level), vsum(quantity), day+10
from table
group by day
order by level

and (at least in SQLlite, postgres would reject the query if level is not functionally dependent on day)

level would be actually a non-deterministic value. It simply chooses one tuple (usually the first) and extracts that value from it.

In Orgtbl-Aggregate, an ill-formed specification such as sorting on Level is not even possible. However sorting on the average of Level is possible, with a specification like this one:

:cols "… vmean(Level);^n …"

which stands for "ascending numerical sorting".

It might be possible to say, in addition to providing a list of org-mode table functions (vmean, etc). provide an elisp
function that takes the same number of parameters as the table, and returns a list of tuples (basically, a list of
tuples of the same size).

It might also be possible to have a function that takes one or more parameters and returns a value.

something like this:

,----
| :cols "vmean(Level) vsum(Quantity) somecomplexfunction(Level, Quantity)"
`----

or perhaps simply a elisp function that takes each row and generates a list of rows:

,----
| :cols elisp-function
`----

This can be achieved currently with the :post specification (not the :cols one). Excerpt from the documentation:

The aggregated table can be post-processed with the :post
parameter. It accepts a Lisp lambda, a Lisp function, or a Babel
block.

The process receives the aggregated table as parameter in the form of
a Lisp expression. It can process it in any way it wants, provided it
returns a valid Lisp table.

A Lisp table is a list of rows. Each row is either a list of cells, or
the special symbol hline.

To summarize, you would like Orgtbl-Aggregate to be more SQLish.

On one hand, this makes sense, as there are a lot of people who know SQL.

On the other hand, there is an installed user base. I cannot break compatibility. Moreover, duplicating SQL features would be an endless race. For example, SQL can act on several input tables, whereas Orgtbl-Aggregate uses a single input table.

If SQL advanced features are required, Org-Mode already provides Babel blocks with several flavors of back-end databases (SQLite, PostgreSQL, MySQL, and so on). Of course, this adds a dependency on an external tool, whereas Orgtbl-Aggregate is a pure Emacs package with absolutely no dependency.

Another path would be to make Orgtbl-Aggregate more R-ish (R is the open-source statistical language). Here too, this would be an endless race. And it would diverge from the SQLish path.

Orgtbl-Aggregate being an open-source project, you may want to fork it and extend it along the lines you described. Do not forget that this involves:

Elisp coding,
Documenting the new features without confusing the reader (the current documentation is already 1700 lines long),
Writing unit tests (there are currently 1500 lines of unit tests),
Maintaining it over years or decades.

Maybe an easier project would be to develop a small compiler to translate SQL to Orgtbl-Aggregate?

Answer 4 · 2023-04-25T06:10:30.000Z

Thank you for your explanation. From what you describe it seems to be sufficient for most needs. I'll close the issue as resolved.