(c) Thomas J. Leeper (2018), licensed CC-BY
The Grammar of Graphics introduced data analysts to an abstract and generalized "grammar" to translate raw data into data visualizations. ggplot2 implemented those ideas in R. dplyr provides a similar grammar for data transformations. These packages have showcased the value of an abstract grammar as an approach to translating a data object into some kind of output and, in particular, have highlighted the value of working with "tidy" data frames as a fundamental unit of data analysis. But lacking from the tidy verse is a grammar of tables, the common, rectangular data structure that summarizes a dataset into cells representing subsets of the data, arranged as rows, columns, and facets. This document outlines a grammar of tables and provides a provisional R implementation.
ggplot2 is successful in part because it separates the content of visualization being created ("geoms" and "aesthetics") and the styling of that visualization ("themes"). A grammar of tables would do the same for tabular presentations of data and it is something that has already been discussed in various places. This would represent a significant advance over existing approaches to tabular construction which closely couple content and styling (e.g., xtable()
) and styling/rendering tools (e.g., stargazer()
) which must be customized to work with every possible input format. R's tabular functions are also needlessly complex, such as the common expression prop.table(table(x))
which summarizes a table rather than conveniently summarizing the original data.
Other statistical software also merges content and styling inflexibly, such as the tab var1, summarize(var2)
and tabstat var2, s(min max mean median) by(var1)
approaches in Stata or the tabular summarization tools in SPSS that provide limited access to the abstract table content independent of the output tables.
Creating a single, abstract table representation makes it possible to render any table without regard for the form of the original data structure from which the table's content is generated. Further separating themes and format-specific rendering should simplify cross-format consistency of representation.
A grammar of tables further allows a computational approach to generating any arbitrary table by specifying how cells should be arranged according to what subsets of the original data they represent, and how those subsets should be set as rows, columns, and/or facets alongside table metadata. This ability to permute table arrangements is one of the reasons that "pivot tables" are seen as useful and that interactive approach to table design might be used to create a table arrangement that can later be used computationally.
A table is an arrangement of summaries of a dataset into rows, columns, and facets (or subtables). A table contains the following components:
-
A cell or set of cells, each of which represents a (sub)set of the data specified by one or more grouping factor(s) analyzed by the summarizing function. A summarizing function (or summarizer) is any function that evaluates to a single output (e.g., a numerical value or character string) for any vector of input values. The simplest summarizer is the identity function; more common examples would be the mean or variance function, or a function that returns a character string combining the two (e.g.,
mean (variance)
). Contingency tables typically utilize only one summarizer (the count or length function), but the general class of tables include tables with multiple summarizers (e.g., contingency tables sometimes contain "marginals"; a regression output table that reports coefficient estimates plus summary statistics for the model). -
An arrangement, which describes the positions of cells within the table into rows, columns, and facets (or subtables).
-
A set of metadata that are not cells or necessarily summaries of data, such as a title, subtitle, notes, annotations, hidden identifiers (e.g., HTML or LaTeX IDs), etc.
-
A theme, which describes the overall visual aesthetics of the table.
-
A rendering, which translates an abstract table representation into a particular output file format, such as markdown, LaTeX, HTML, docx, rtf, Excel, etc.
Each of the components 2-5 (arrangement, metadata, theme, and rendering) can be combined in multiple ways to express the informational content of the table (i.e., its cells) for some particular output. This means that a table takes two forms: (1) its abstract form containing only the table content and (2) its finalized form containing the content expressed through an arrangement, theme, and rendering (that may display metadata).
The contents of a cell are determined by a summarizer (a function, often a summary statistic) applied to each subset of the data defined by the grouping factors that compose the rows, columns, and facets. An implementation of the grammar of tables does not necessarily need to implement summarizers because the necessarily aggregations can be performed using existing functionality (e.g., aggregate()
, dplyr syntax, etc.).
That said, implementing some summarizers within a grammar of tables implementation may be useful, as it enables the combination of cells that reflect different summarizers within a single table. For example, a contingency table may contain subgroup counts as well as "row totals" and/or "column totals" that summarize the data at a higher level of aggregation. Combining these cell types requires knowing how the various cells summarize the original data in order to be able to arrange them systematically.
The separation of summarization from arrangement can be highlighted in a simple example. This tabular representation of the Titanic
survivor data:
ftable(Titanic, row.vars = 1:3)
## Survived No Yes
## Class Sex Age
## 1st Male Child 0 5
## Adult 118 57
## Female Child 0 1
## Adult 4 140
## 2nd Male Child 0 11
## Adult 154 14
## Female Child 0 13
## Adult 13 80
## 3rd Male Child 35 13
## Adult 387 75
## Female Child 17 14
## Adult 89 76
## Crew Male Child 0 0
## Adult 670 192
## Female Child 0 0
## Adult 3 20
is simply a particular representation of a sum()
aggregation of the data by Class
, Sex
, Age
, and Survived
:
aggregate(Freq ~ ., data = as.data.frame(Titanic), FUN = sum)
## Class Sex Age Survived Freq
## 1 1st Male Child No 0
## 2 2nd Male Child No 0
## 3 3rd Male Child No 35
## 4 Crew Male Child No 0
## 5 1st Female Child No 0
## 6 2nd Female Child No 0
## 7 3rd Female Child No 17
## 8 Crew Female Child No 0
## 9 1st Male Adult No 118
## 10 2nd Male Adult No 154
## 11 3rd Male Adult No 387
## 12 Crew Male Adult No 670
## 13 1st Female Adult No 4
## 14 2nd Female Adult No 13
## 15 3rd Female Adult No 89
## 16 Crew Female Adult No 3
## 17 1st Male Child Yes 5
## 18 2nd Male Child Yes 11
## 19 3rd Male Child Yes 13
## 20 Crew Male Child Yes 0
## 21 1st Female Child Yes 1
## 22 2nd Female Child Yes 13
## 23 3rd Female Child Yes 14
## 24 Crew Female Child Yes 0
## 25 1st Male Adult Yes 57
## 26 2nd Male Adult Yes 14
## 27 3rd Male Adult Yes 75
## 28 Crew Male Adult Yes 192
## 29 1st Female Adult Yes 140
## 30 2nd Female Adult Yes 80
## 31 3rd Female Adult Yes 76
## 32 Crew Female Adult Yes 20
Other tables with the same arrangement as the first might be created from an alternative summarizer, such as taking the mean age of subset, proportion of women, etc.
A good example of many common summarizers designed in a tidy framework can be found in the tabyl()
functionality of the janitor package.
Consider the following abstract table that contains four cells, which summarize a dataset by two grouping factors (V1
and V2
):
V1 | V2 | Summarizer | Value |
---|---|---|---|
A | C | Count | 1 |
A | D | Count | 2 |
B | C | Count | 3 |
B | D | Count | 4 |
This example highlights what distinguishes a table from a general dataset: it contains only one "value" column. This column contains the cells of the table and other columns contain the group factors or other information (e.g., what summarizing function generated that value).
The table's content can be transformed in any of the following arrangements (plus numerous others) which hold the content (and theme and rendering) constant:
| B | C | 3 |
| B | D | 4 |
| A | C | 1 |
| A | D | 2 |
| A | D | 2 |
| A | C | 1 |
| B | D | 4 |
| B | C | 3 |
| B | D | 4 |
| B | C | 3 |
| A | D | 2 |
| A | C | 1 |
| A | C | 1 |
| B | C | 3 |
| A | D | 2 |
| B | D | 4 |
| A | D | 2 |
| B | D | 4 |
| A | C | 1 |
| B | C | 3 |
| A | A | B | B |
| C | D | C | D |
| 1 | 2 | 3 | 4 |
| B | B | A | A |
| C | D | C | D |
| 3 | 4 | 1 | 2 |
| A | B | A | B |
| C | C | D | D |
| 1 | 3 | 2 | 4 |
| A | B | A | B |
| D | D | C | C |
| 2 | 4 | 1 | 3 |
| | C | D |
| A | 1 | 2 |
| B | 3 | 4 |
| | C | D |
| B | 3 | 4 |
| A | 1 | 2 |
| | D | C |
| A | 2 | 1 |
| B | 4 | 3 |
| | D | C |
| B | 4 | 3 |
| A | 2 | 1 |
The theme contains the general aesthetics of the final output table. Ideally, themes are independent of the renderer, such that any theme can be rendered similarly in any markup language. The simplest theme is the empty theme, which contains no aesthetic information, allowing users to express a theme in the output format (e.g., CSS, etc.).
Table themes are complex and something to be thought about later on. A useful reference may be: https://medium.com/mission-log/design-better-data-tables-430a30a00d8c#.cprfjagxl
The renderer of the table convey the cell content, arrangement, metadata, and theme in a markup language. Commonly needed renderers for R would be:
- markdown
- latex
- html
- OpenDocument
- docx
- rtf
- Excel
- others?
Implementing a renderer to work with this grammatical approach to tables will require a set of renderers that respect a general grammar of arrangement and the format-independent expression of a theme. Existing rendering functions - xtable, kable, rtf, stargazer, DT, etc. etc. - are probably not going to be adequate to work with all formats.
# - renderers
# - xtable
# - flextable
# - rtf
# - knitr::kable()
# - formatttable
# - knitLatex
# - htmlTable
# - psytabs
# - SortableHTMLTables
# - tablaxlsx
# - table1xls
# - tableHTML
# - TableMonster
# - texreg
# - ztable
# - apaStyle
# - apaTables
# - apsrtable
# - DT
# - higher-level functionality
# - janitor
# - huxtable
# - tables
# - stargazer
# - pixiedust
# - reporttools
# - rtable
# - summarytools
# - tab
# - tableone
# - carpenter
# - dtables
# - etable
# - tangram
# - pivots
# - rpivotttable
# - other
# - gtable
The huxtable website provides a useful breakdown of many of these existing packages.