Suggestion: separate group by parameters from the projection of the grouping operations
dmgerman opened this issue · 4 comments
This module is great. I have a suggestion that might make it better.
Aggregation is an operation with two parameters: the columns to aggregate by, and the operations on each of the subgroups. I think the syntax of the blocks would be cleaner if those two parameters are separated:
- one parameter for the aggregation column (or columns, perhaps a future option)
- one parameter for the actual projected values for each of the subgroups (which might include the aggregation column, or not, or perhaps not in the first location.
Just a thought. This is how SQL implements the operation: subgroup by group by attributes, and then project based on the SELECT statement the results for each subgroup)
Thanks for your interest!
I feel that you are on something. But I am not sure I understand what you mean.
"the blocks would be cleaner if those two parameters are separated"
Orgtb-aggregate
already separates those two kind of parameters. In this example:
:cols "Day vmean(Level) vsum(Quantity)"
Day
is of the first kind (aggregation),vmean(Level)
andvsum(Quantity)
are of the second kind (operations on subgroups).
Do you mean that a different syntax would be clearer? Like for example:
:aggregation "Day" :operations "vmean(Level) vsum(Quantity)"
The current :cols
syntax says something more. It gives what the resulting table should look like. So, the following two instructions give different results:
:cols "Day vmean(Level) vsum(Quantity)"
:cols "vmean(Level) vsum(Quantity) Day"
The resulting columns are ordered differently:
| Day | vmean(Level) | vsum(Quantity) |
|---------+--------------+----------------|
| Monday | 27.5 | 14 |
| Tuesday | 43 | 45 |
…
| vmean(Level) | vsum(Quantity) | Day |
|--------------+----------------+---------|
| 27.5 | 14 | Monday |
| 43 | 45 | Tuesday |
…
Ordering of resulting columns could not be specified with an hypothetical :aggregation
and :operations
syntax.
Maybe you could give some examples of what you are thinking about, both in Orgtb-Aggragate and SQL?
OK, clear.
yes, it would be the order of the operations. If you want day, add day to the operations:
,----
| :operation "day vmean(Level) vsum(Quantity)"
`----
OK. Currently, in Orgtbl-Aggregate
, it is the other way around. If you do not want Day
in the output, specify an invisibility formatting, as follow:
:cols "Day;<> …"
I guess the current model is such that any identifier without following left parenthesis is assumed to be the grouping attributes.
Yes, mostly. In the current model, a naked name (without parenthesis or operation) is assumed to be a grouping column.
and they have to the be the first in the list.
Not necessarily the first. A grouping column name may appear anywhere in the :cols
specification.
select from vmean(level), vsum(quantity), day+10
Note that the resulting table does contain day +10 (not day), even though the grouping was done by this attribute.
You can do that in Orgtbl-Aggregate
, although the result may not be exactly the same as in SQL
:
:cols "Day;<> vmean(Level) vsum(Quantity) Day+10"
Here Day;<>
is a naked name of an input column. Therefore, it acts as a grouping specification. In SQL
it would appear in the group by
clause. The invisibility formatting ;<>
prevents it to be output.
Then Day+10
is not a naked column name. Therefore it acts as a formula to be applied on sub-groups. In SQL
it would appear in the select
clause.
The result in Orgtbl-Aggregate
can be surprising. First because Day
is not numeric. Second because it generates a list of identical values. Example:
[Tuesday + 10, Tuesday + 10, Tuesday + 10]
You may want to aggregate those values with an aggregator. For instance vmean(Day+10)
(average value) or rdup(Day+10)
(remove duplicates).
select from vmean(level), vsum(quantity), day+10
from table
group by day
order by leveland (at least in SQLlite, postgres would reject the query if level is not functionally dependent on day)
level would be actually a non-deterministic value. It simply chooses one tuple (usually the first) and extracts that value from it.
In Orgtbl-Aggregate
, an ill-formed specification such as sorting on Level
is not even possible. However sorting on the average of Level
is possible, with a specification like this one:
:cols "… vmean(Level);^n …"
which stands for "ascending numerical sorting".
It might be possible to say, in addition to providing a list of org-mode table functions (vmean, etc). provide an elisp
function that takes the same number of parameters as the table, and returns a list of tuples (basically, a list of
tuples of the same size).It might also be possible to have a function that takes one or more parameters and returns a value.
something like this:
,----
| :cols "vmean(Level) vsum(Quantity) somecomplexfunction(Level, Quantity)"
`----or perhaps simply a elisp function that takes each row and generates a list of rows:
,----
| :cols elisp-function
`----
This can be achieved currently with the :post
specification (not the :cols
one). Excerpt from the documentation:
The aggregated table can be post-processed with the :post
parameter. It accepts a Lisp lambda
, a Lisp function, or a Babel
block.
The process receives the aggregated table as parameter in the form of
a Lisp expression. It can process it in any way it wants, provided it
returns a valid Lisp table.
A Lisp table is a list of rows. Each row is either a list of cells, or
the special symbol hline
.
To summarize, you would like Orgtbl-Aggregate
to be more SQLish.
On one hand, this makes sense, as there are a lot of people who know SQL
.
On the other hand, there is an installed user base. I cannot break compatibility. Moreover, duplicating SQL
features would be an endless race. For example, SQL
can act on several input tables, whereas Orgtbl-Aggregate
uses a single input table.
If SQL
advanced features are required, Org-Mode
already provides Babel blocks with several flavors of back-end databases (SQLite
, PostgreSQL
, MySQL
, and so on). Of course, this adds a dependency on an external tool, whereas Orgtbl-Aggregate
is a pure Emacs
package with absolutely no dependency.
Another path would be to make Orgtbl-Aggregate
more R-ish (R
is the open-source statistical language). Here too, this would be an endless race. And it would diverge from the SQLish path.
Orgtbl-Aggregate
being an open-source project, you may want to fork it and extend it along the lines you described. Do not forget that this involves:
Elisp
coding,- Documenting the new features without confusing the reader (the current documentation is already 1700 lines long),
- Writing unit tests (there are currently 1500 lines of unit tests),
- Maintaining it over years or decades.
Maybe an easier project would be to develop a small compiler to translate SQL
to Orgtbl-Aggregate
?
Thank you for your explanation. From what you describe it seems to be sufficient for most needs. I'll close the issue as resolved.