jgm/pandoc

Support for table column spans, table attributes in AST

dashed opened this issue Β· 107 comments

I tried looking for this within the Pandoc's docs. None of the flavours of markdown tables support column spanning. I don't think there are known markdown flavours that support column spanning except for multimarkdown.

Are there plans to support this?

jgm commented

+++ Alberto Leal [Oct 16 13 17:35 ]:

I tried looking for this within the Pandoc's docs. None of the flavours of markdown tables support column spanning. I don't think there are known markdown flavours that support column spanning except for multimarkdown.

Are there plans to support this?

Long-term, yes, I'd like to.

Are there any plans for this? I'm also interested in this.

jgm commented

+++ jokogr [Oct 22 15 08:00 ]:

Are there any plans for this? I'm also interested in this.

Yes, it would be good to do, but it's a big change as it
requires changes in the underlying document model.

Is there anything I could do to speed this up?

adius commented

+1

ousia commented

Are there any plans for this? I'm also interested in this.

@jokogr, sorry for my obvious reply: providing a patch may help.

jgm commented

+++ Pablo RodrΓ­guez [Mar 09 16 22:03 ]:

Are there any plans for this? I'm also interested in this.

[1]@jokogr, sorry for my obvious reply: providing a patch may help.

This would require some major architecture change,
including changes in pandoc-types, all readers and writers.

@jgm is there any way to achieve this with a two-step process? Compile multimarkdown to an intermediate state and then that result with pandoc?

jgm commented

+++ Brian Feister [Mar 10 16 08:50 ]:

[1]@jgm is there any way to achieve this with a two-step process?
Compile multimarkdown to an intermediate state and then that result
with pandoc?

No, the problem is very simple. Pandoc's internal document
model doesn't allow colspans or rowspans. It's on the list
of things to improve.

ickc commented

Sidenote: probably this issue should be applied the label AST change.

In an attempt to make a suggestion here, I stumbled on issue #3154: pandoc "almost" has a 5th table extension: using native HTML as table. If it were true:

then after the AST changed to allow colspan and rowspan, then before we settled a syntax(es) for them, we can immediately start using it. For example, in md to LaTeX conversion, it can eats the colspan and rowspan and spills the multicolumn and multirow.

The reason for this suggestion is that settling for a syntax(es) is often tricky and requires a lot of discussions (except for possibly "mmd_colspan" since it is already there). But if it only requires the AST change (which is a prerequisite anyway for the new syntax(es) as explained) would make it easier.

jgm commented

An AST change to tables would require changes in both
readers and writers. You're right that we would not
necessarily need to support a native Markdown table
syntax with rowspans and colspans right away. We
could just implement HTML tables. But we'd still need
to change ALL the writers to handle rowspans and colspans,
since these would be in the basic table model. That's
already somewhat daunting (I suppose Markdown could fall
back to HTML, but RST couldn't).

ickc commented

@jgm said,

But we'd still need
to change ALL the writers to handle rowspans and colspans,
since these would be in the basic table model.

I don't know much about the design of the AST. Can a new AST including column & row span be a superset of the current AST?

If so, the transition period can be made smoother, i.e. a gradual roll out of the feature rather than changing the AST and all the writers & readers at the same time:

  1. The AST can be changed first. Since it is the superset of the original, every reader/writer would still works
  2. The column/row span extension can be activated by a feature flag of each writer and reader, documented in a matrix like this from OpenZFS.
  3. In general, writer has higher priority to implement the feature flag than reader (except for markdown reader), since every existing reader generated a valid AST.
  4. Only when both the from-format reader & to-format writer has the feature flag, the extension is activated.

One of the reasons this feature is important is that in scientific setting whenever one needs to compare multiple groups, some kind of subheader is needed in the table. Having that, would permanently make pandoc place firm foot within the space of scientific writing.

jgm commented

+++ lf_araujo [Oct 11 16 22:34 ]:

One of the reasons this feature is important is that in scientific
setting whenever one needs to compare multiple groups, some kind of
subheader is needed in the table. Having that, would permanently make
pandoc place firm foot within the space of scientific writing.

Actually there are two issues here, right?

  1. column spans
  2. the ability to have multiple rows in a header
jgm commented

No, changing the table AST would definitely require changes
to all writers and readers, immediately. The writers would
all have to know what to do when they encounter colspans.

ickc commented

I don't understand. Can we put a switch in the reader such that when a pandoc command is used, knowing the output writer do not understand colspans, then the reader do not parse colspans. It seems like somekind of switch are already being used, say when some extensions is turned off in the command line.

I can see a problem might occur if the input format is AST or JSON where the reader cannot (as least difficult to) switch off the colspan/row features. But people should know what they are doing if AST/JSON is used (and show an error message to them).

Actually there are two issues here, right?

Yes. The ability to format subheaders would be also needed.

jgm commented

+++ ickc [Oct 12 16 01:15 ]:

I don't understand. Can we put a switch in the reader such that when a
pandoc command is used, knowing the output writer do not understand
colspans, then the reader do not parse colspans. It seems like somekind
of switch are already being used, say when some extensions is turned
off in the command line.

No, readers are always independent of writers. But you're
not seeing the problem. Even if we had a switch that
dependend on the output format, the readers would still need
to be rewritten because of changes in the Pandoc type
(specifically the Table constructor).

ickc commented

I see. Then that is really a huge task. And I guess since changing AST would have compatibility issue, one didn't want to change that often, which means the strategy is probably to do serveral important AST change at once, which would make it even a bigger challenge.

And since AST change will break backward compatibility, it is safe to say it will only be in pandoc v2.0? In that case should a milestone be setup (even with no deadline), and add those important AST change to it (among other things).

Thanks for the attention. I will leave two models of tables that are prevalent in papers. The first should be approachable in future iterations of pandoc, the second one, however, is a little more tricky and may not.

| Area                      |  Subjects       |      Controls   |
|---------------------------|-----------------|-----------------|
|                           |SD | se | p-value|SD | se | p-value|
|===========================|=================|=================|
| Standardised coefficients                                   |||
|===========================|=================|=================|
| Left fusiform area        | 1 | 2 | .05     | 3 | 4 |  .05    |
| Right insula              | 5 | 6 | .05     | 7 | 8 |  .05    |
| Left insula               | 5 | 6 | .05     | 7 | 8 |  .05    |
| Right fusiform area       | 1 | 2 | .05     | 3 | 4 |  .05    |
|===========================|=================|=================|
| Factor loadings                                             |||
|===========================|=================|=================|
| X                         | 1 | 2 | .05     | 3 | 4 |  .05    |
| Y                         | 5 | 6 | .05     | 7 | 8 |  .05    |
| Z                         | 5 | 6 | .05     | 7 | 8 |  .05    |

The equals were used to represent the bits that usually have lines to separate the subreader.

The second trickier table occurs when one wants to span vertically two cells. This is not essential, I am putting as an example of a common types of tables (a table for description of the population in this case).

| **Variables**                  |  **Healthy subjects (mean)** |   **Patients (mean)**   | **p-value**  |
|--------------------------------|------------------------------|-------------------------|--------------|
| **Age**                        |       11                     |        28.51            |    .01^(U)^  |
| **Gender**                                                                                          ||||                                                                     
| Male (%)                       |        12%                   |      99%                |              |
| Female (%)                     |        13%                   |         88%%            |  .99^(a)^    |                                   
| **Time from onset (days)**     |  NA                          |        111              |              |
| **Education (mean, in years)** |  10                          |     5                   |   .11 ^(U)^  |                                                      

What typically happens is a merge of the cell containing the value .99 and the cell above. That statistics concerns both Male and Female. I hope I am being clear.

jgm commented

Pandoc currently allows at most one header row, which must be at the top. A rule is inserted below it in default LaTeX output.

One could try to separate conceptually between being a header cell and having a rule under, so that a cell could have one of these properties without the other. Perhaps the idea in @lf-araujo's example is that a rule of hyphens --- divides the header from the rest, while a rule of equals === indicates a rule? But do the hyphens also cause a rule to be rendered?

jgm commented

Here's a proposal for an AST change:

type Rowspan = Int
type Colspan = Int
data Cell = Cell Rowspan Colspan [Block]
data Row = Row [Cell] | Header [[Cell]]
type Caption = [Block]
-- constructor for Table in Block:
   | Table Attr Caption [Alignment] [Maybe Double] [Row]

Improvements:

  1. Can have cells spanning columns
  2. Can have cells spanning rows
  3. Can have multi-row headers and secondary headers
  4. Tables can have attributes (e.g. id, class)
  5. Column widths are optional for each column
  6. Captions can have block-level content

Thoughts on this?

Of course, if we implemented this, one should not expect support for all features in all readers/writers. Some tables representable this way may not be renderable in some formats.

mb21 commented

Looks good on quick glance. Just crossed my mind that we could additionally add [Attr] [Attr] to Table in order to have attributes on the rows and columns as well (the alignments and widths could instead be considered attributes on the columns, not sure if this would clean up or complicate the readers and writers). Then again, this might go too far...

ickc commented

@jgm said:

Column widths are optional for each column

What does this mean? I think currently unspecified widths will be just [0.0, 0.0, ...]. Does it mean in the future it can be [0.0, 0.2, 0.0, 0.3, ...]? What would this mean?

Some tables representable this way may not be renderable in some formats.

How to handle and output format that don't support colspan and rowspan? It would be nice that there's some way to output a table without rowspan/colspan while trying to preserve this structure. The simple way will be putting a bunch of empty cells on the rest of the rowspan/colspan. But then may be some sort of keyword convention can be put there to indicate to human readers that the cell is supposed to be a continuation of the previous one. I'm particularly interested in this because the pantable csv "reader & writer" is currently "lossless" w.r.t. AST to CSV and vice versa.

Other questions are:

  1. it seems this proposal can allow header row after non-header row?

  2. Any plan to support the vertical header mentioned in #1359 (e.g. through transpose specified in attributes)?

  3. @mb21's suggestion on granting attributes on rows and columns might be useful too. Although it might seems overkill for now, but this will ensure it to be "feature-complete" and I'm sure some people might find it useful (since people are already doing it in HTML). And for the markdown syntax for this, the cells in the first row/column following standard pandoc attribute syntax would do.

I can't help with the AST, but I have been generating tables with multimarkdown to latex and importing them into my mds and later processing with pandoc.

There are two problems, one is the row span for which I don't think there is an easy solution. The second problem is the column span for which I suggested a layout previously. This design can be further simplified to:

| Area                      |  Subjects       |      Controls   |
|                           |SD | se | p-value|SD | se | p-value|
|---------------------------|-----------------|-----------------|
| Standardised coefficients                                   |||
|---------------------------|-----------------|-----------------|
| Left fusiform area        | 1 | 2 | .05     | 3 | 4 |  .05    |
| Right insula              | 5 | 6 | .05     | 7 | 8 |  .05    |
| Left insula               | 5 | 6 | .05     | 7 | 8 |  .05    |
| Right fusiform area       | 1 | 2 | .05     | 3 | 4 |  .05    |
|---------------------------|-----------------|-----------------|
| Factor loadings                                             |||
|---------------------------|-----------------|-----------------|
| X                         | 1 | 2 | .05     | 3 | 4 |  .05    |
| Y                         | 5 | 6 | .05     | 7 | 8 |  .05    |
| Z                         | 5 | 6 | .05     | 7 | 8 |  .05    |

So no need for equals, instead hyphens can do the trick. The first hyphen conjunct to appear should represent the end of the first header. Each following conjunct of hyphens should represent the beginning and the end of headings for sections within the table (or only markups for printing /midrule in latex).

As for the vertical cell span, which is also very important in publication, unless someone comes up with a readable way of representing it in plain text, it probably should be left out of the changes for now.

ickc commented

@lf-araujo, I think you're talking about pipe tables. In grid tables I think the rowspan and colspan will comes naturally. Currently although pandoc has 4 table syntax, not all of them support all features supported by the AST. In fact, only grid table syntax support everything the AST is capable of. And this can be reasonably expected to be true too once rowspan/colspan is implemented.

Personally, I have no easy way to write grid table (when I need all features supported by pandoc's AST), since I don't use emacs. That's why I wrote a filter, pantable, to do something similar but in CSV instead. But it will be challenging to support colspan/rowspan in CSV with pantable. That's why I have a question about this above.

jgm commented
jgm commented
mb21 commented
  1. Any plan to support the vertical header mentioned in [2]#1359 (e.g. through transpose specified in attributes)?

No, I don't see a clean way to do that.

It wouldn't be a semantically clean way, but with attributes for the columns we could at least make it bold or add a class to it.

jgm [27 Feb]:

I actually do have some ideas there, but the AST can change
even if we don't have a way of representing in plain text.
After all, one might use pandoc to convert from HTML to
DocBook, for example, and both formats have easy raws of
representing rowspans. Pandoc can revert to raw HTML when
rendering a table with rowspans in Markdown (as it does
now).

I agree. While I see the advantage to reliably providing like-for-like conversions, pragmatically I think it's seriously worth considering fallbacks.

Rowspan / colspan table attributes are a necessary part of academic / research paper formats; being able to use Pandoc to convert these is something that would have positive effect.

I'd be very keen to see this feature available ASAP.

jgm commented

Here is a plan of action that will also us to integrate the new table features little by little, without doing everything in one massive push:

  • Decide on new AST type.
  • Make changes to pandoc-types, including changes in Builder, JSON, etc.). But, for now, keep the signature of Builder.table the same.
  • Modify pandoc.cabal and stack.yaml so pandoc builds against the new pandoc-types commit. Build will fail.
  • Add a temporary function that destructures a Table into the same fields we currently have in Table (with appropriate fallbacks). Use this to quickly convert the current writers to work with the new Table type. Readers should already work, because Builder.table has the same signature. A few changes is auxiliary functions may be needed. At this point, pandoc should compile against the new pandoc-types, but it will have no new table fetaures. The idea is that we'll add these gradually.
  • Modify signature Builder.table so it can construct new-style tables with all the features. Modify the readers to use the new Builder.table, but not yet to include any new table features. Pandoc should again compile.
  • At this point we can work on individual readers and writers, converting them to use the new table features.
jgm commented

We should try to ensure that there's only one way to represent a given table in the data type, and that bad tables are impossible to represent. See #3648.

jgm commented

The following types were proposed earlier:

type Rowspan = Int
type Colspan = Int
data Cell = Cell Rowspan Colspan [Block]
data Row = Row [Cell] | Header [[Cell]]
type Caption = [Block]
-- constructor for Table in Block:
   | Table Attr Caption [Alignment] [Maybe Double] [Row]

Ideally we could get more guarantees into the types, though I'm not sure how. This representation

  • does not require that the number of alignment specifiers = the number of width specifiers, or that either = the number of columns in a row.
  • does not require that rows all have the same number of columns (taking into account colspans)
  • does not require that columns all have the same number of rows (taking into account rowspans).

For example, with this setup, we can represent a table

Row [Cell colspan=1 rowspan=2 A, Cell colspan=1, rowspan=1 B]
Row [Cell colspan=2 rowspan=1 C]

That doesn't really make sense. Maybe we can't do much better, though, without dependent types. At least we could switch to

   | Table Attr Caption [(Alignment, Maybe Double)] [Row]
mb21 commented

I guess its kind of the same issue as with typed attributes: are we willing to use special-ghc features (like dependable types or view patterns) to make the code more type safe (while keeping it generic) or do we favour a more simple and accessible code base?

Personally, I'm fine with using using more advanced ghc features if it helps with future maintenance...

jgm commented

Slight improvement

type Rowspan = Int
type Colspan = Int
data Cell = Cell Rowspan Colspan [Block]
data Row = Row [Cell] | Header [[Cell]]
type Caption = [Block]
-- constructor for Table in Block:
   | Table Attr Caption [(Alignment, Maybe Double)] [Row]

It may be worthwhile to cross check whatever syntax you decide on against docutils grid table implementation. They document their data structure pretty well and have supported this multi-cell spanning functionality for a while without much grief: https://sourceforge.net/p/docutils/code/HEAD/tree/trunk/docutils/docutils/parsers/rst/tableparser.py#l91

jgm commented

For convenience I copy the docutils comment here:

 Here's an example of a grid table::

        +------------------------+------------+----------+----------+
        | Header row, column 1   | Header 2   | Header 3 | Header 4 |
        +========================+============+==========+==========+
        | body row 1, column 1   | column 2   | column 3 | column 4 |
        +------------------------+------------+----------+----------+
        | body row 2             | Cells may span columns.          |
        +------------------------+------------+---------------------+
        | body row 3             | Cells may  | - Table cells       |
        +------------------------+ span rows. | - contain           |
        | body row 4             |            | - body elements.    |
        +------------------------+------------+---------------------+

    Intersections use '+', row separators use '-' (except for one optional
    head/body row separator, which uses '='), and column separators use '|'.

    Passing the above table to the `parse()` method will result in the
    following data structure::

        ([24, 12, 10, 10],
         [[(0, 0, 1, ['Header row, column 1']),
           (0, 0, 1, ['Header 2']),
           (0, 0, 1, ['Header 3']),
           (0, 0, 1, ['Header 4'])]],
         [[(0, 0, 3, ['body row 1, column 1']),
           (0, 0, 3, ['column 2']),
           (0, 0, 3, ['column 3']),
           (0, 0, 3, ['column 4'])],
          [(0, 0, 5, ['body row 2']),
           (0, 2, 5, ['Cells may span columns.']),
           None,
           None],
          [(0, 0, 7, ['body row 3']),
           (1, 0, 7, ['Cells may', 'span rows.', '']),
           (1, 1, 7, ['- Table cells', '- contain', '- body elements.']),
           None],
          [(0, 0, 9, ['body row 4']), None, None, None]])

    The first item is a list containing column widths (colspecs). The second
    item is a list of head rows, and the third is a list of body rows. Each
    row contains a list of cells. Each cell is either None (for a cell unused
    because of another cell's span), or a tuple. A cell tuple contains four
    items: the number of extra rows used by the cell in a vertical span
    (morerows); the number of extra columns used by the cell in a horizontal
    span (morecols); the line offset of the first line of the cell contents;
    and the cell contents, a list of lines of text.
jgm commented

Helpful comment on the commonmark forum about the need for row headers to support accessibility (screen readers). So maybe we need to think harder about how to add that.

Some more information (and examples) about accessible tables: https://www.w3.org/WAI/tutorials/tables/

Is there any place I can see the specification for this planned feature (so I can start implementing it and sending patches), or the design is still WiP?

jgm commented

We haven't yet settled on a specification. I'd like to get it right before we do the coding, because it will be a pain to revise it in the future if we don't. But, help on this is most definitely welcome (both in the design and in the coding phase). If you have a concrete suggestion for the data types, after reading the above discussion, feel free to make it here. When we get to the coding phase I would love to have help.

I would like to do this for pandoc 2.0, just trying to resolve some other issues first.

If you are on the planning stage, I'll use that opportunity and post some suggestions for the potential enhancement to the markdown syntax you support.

Here it is: #3782

Pandoc is very well integrated into the R statistical environment, and is the format of choice when doing automatically generated scientific reports. I believe multi-cell tables would be welcome by many (if not most) data scientists on this planet. :-)

ickc commented

Here it is: #3782

Why don't you just add it to this issue? I guess #3782 should be closed because it is a duplicate of this.

mb21 commented

Just documenting multimarkdown's colspan syntax:

|             |          Grouping           ||
First Header  | Second Header | Third Header |
 ------------ | :-----------: | -----------: |
Content       |          *Long Cell*        ||
Content       |   **Cell**    |         Cell |

To indicate that a cell should span multiple columns, then simply add additional pipes (|) at the end of the cell, as shown in the example. If the cell in question is at the end of the row, then of course that means that pipes are not optional at the end of that row…. The number of pipes equals the number of columns the cell should span.

mb21 commented

About the Haskell structure, it may be interesting to see how GraphViz does it...

Looking at the definition of Table, compared to the other Block types; it has a load of parameters, a few of them optional (caption and header, and few with default values (alignment and column widths).
I had a quick look at the HTML and Latex writers, none of these seem to actually pattern match on these values (which @jgm seems to have as a concern in #684 regarding how a general Attributes type should be defined).
The pattern matching I have found (by browsing quickly through the two writers) was based on nested elements inside for example Block elements.

What I am trying to say here is that at least the optional Table parameters could easily be placed inside some sort of Attribute instead. I would even go as far as to state that the two parameters with default values could be as well.
That would not complicate the code in my opinion.

Handling the caption would be something like

blockToHtml opts (Table capt aligns widths headers rows') = do
  captionDoc <- if null capt
                   then return mempty
                   else do
[...]
blockToHtml opts (Table Attr rows') = do
  captionDoc <- if isNothing (lookup "tableCaption" Attr)
                   then return mempty
                   else do
[...]

I know this is a wee bit simplified, as the caption is a list of Inline, and the lookup of the Attr class that @mb21 proposed on 26 Feb returns a String. However that could be solved by storing the inlines as the HTML string (or Latex string, etc) and then perhaps changing the lookup function return a Maybe instead.

Regarding the non optional elements. The writer seems to assume that they have the correct lengths? So I guess it could just as well assume that they are available in the attributes. But of cause if you wan't to use dependent types at some point to assert the correct length of the list, then I assume it would be required that they were not stored inside the attributes, but kept as they are now.

One issue, however, is that one would need to know which attributes is "special" and as such should not be written in the final output. For example in the HTML writer the "tableCaption" attribute should be removed from the list, before the table tag is being generated with the remaining attributes added.

jgm commented

See #2978; it would be useful to have a way to specify a "short caption."

jgm commented

I don't have time to do this in the near future, so I'm removing it from the pandoc 2.0 milestone so as not to hold up the release.

That's a real shame about missing 2.0.

What sort of version bump would such a change to the AST require β€” is 2.0 β†’ 2.1 acceptable, or would it have to wait for a 3.0?

jgm commented
ickc commented

I think it's a good strategy:

  • Too much change in 1 release put too much burden not only to the pandoc developers but also to the developers of the "pandoc ecosystem", e.g. the wrappers/interfaces
  • supporting table column/row span is (probably) not backward incompatible in syntax (i.e. old documents, cli scripts, created in pandoc 2.0 should still be valid)
  • wait for pandoc-types 2.0?

I'm willing to help (I have haskell experience, though not lua), but circumstances do not allow it for another 6 weeks.

@jgm I note in your proposal above that you are adding Attr to Table

-- constructor for Table in Block:
   | Table Attr Caption [(Alignment, Maybe Double)] [Row]

Do you want to resolve this at the same time as #684 (Attr for all block elements?)?

ickc commented

For #684, I don't think it has been settled yet. I think as in many issues here, the bottleneck is not in implementing it, but to discuss and decide on which syntax/feature/etc. is exactly needed. Last time @jgm spoke about the #684 issue, he's still not convinced that all should receive it. It seems that the mentality is if for some elements we can get away with having no attributes, then we shouldn't grant it.

And it is kind of similar for this issue too. Not only the AST is not settled (is it?), but the syntaxes to use this feature (say in pandoc markdown) is also not.

And just to note the level of complexity of this issue: all reader/writer pairs has to support it (even for some format it has to be rendered in HTML), so it is quite non-trivial. I'd imagine it will be easier to be done by a couple of people, but then someone need to coordinate it, for example, settling the AST first, and then make a list of formats, and then say for these formats, HTML will be used, for these others (markdown, rst, etc.), grid table will be used. And I imagine the LaTeX pair will be kind of difficult because the from-format can have a lot of different package variants, and we may need to be careful on the suitability of long table in the to-format (not to mention there are other issues of rendering tables only using long table).

Lastly, I guess @jgm want to release pandoc 2.0 ASAP. Six weeks might be too long for that (guessed from comments on the polyglot HTML writer).

jgm commented

If helpful, this is the expected compatibility matrix (assuming || is adopted)?

Type Row Spans? Col Spans?
Simple Tables No No
Multiline Tables No Yes (?)
Grid Tables Yes Yes
Pipe Tables No Yes (?)

I don't know if this is appropriate for this thread, but would it make sense to promote grid tables as enabled by default, as well as the default output for the markdown writer? It would make the table to table conversion from say TeX or html a little more convenient if you can assume that an input row and colspan table will always convert to an output row and colspan capable format.

jgm commented

Another point of reference for this discussion: Grid tables are a pain to write in most cases and also create pretty hard to review diffs: If you are looking for text formats that allow row and column spans, The Linux Kernel team created a restructured text extension to create what they call a "flat table" which allows for column and row spans as well as multiple header rows. Here's the spec https://return42.github.io/linuxdoc/linuxdoc-howto/table-markup.html#rest-flat-table

If you guys something similar to a flat table type to cover these advanced table markup requests there may be some potential wins both with respect to ease of development as well as ease of use for pandoc markdown users.

ickc commented

The Linux Kernel team created a restructured text extension to create what they call a "flat table" which allows for column and row spans as well as multiple header rows.

One suggestion is that after pandoc add the column/row span feature in its AST (pandoc 2.1?), then the rst reader will add the support of this extension. And then people who would want to use this feature "in markdown" could use the new pandoc 2.0 extension raw_attribute to inline an rst table in markdown. The downside is that markdown syntax won't be allowed within the table. But from what I observe, the pandoc community holds a high standard on what markdown extensions should be added according to the "markdown test". From what I read from your link, that table format is not markdown-ish at all. So most probably it will not be considered as a markdown table extension in pandoc.

I feel your pain in writing tables in markdown too. Currently, the grid table is the only one (out of 4) in pandoc that can utilizes every table feature in the AST. That's why I wrote pantable, to write my markdown tables in CSV format instead (which also exhaust all pandoc's table feature in AST, and has a reader/writer pair to jump back and forth between native table and csv table). By the way, it would be a challenge to support col/row span feature in this CSV format, but it's my intention to do that because I can't imagine writing tables in any of the native extensions.

Nice! A csv format that supports all the features of the AST would be perfect! I've always imagined trying to make that work but haven't had the time to try and get something working. I'm watching pantable now πŸ‘

mb21 commented

Any plan to support the vertical header mentioned in #1359

Helpful comment on the commonmark forum about the need for row headers to support accessibility (screen readers)

HTML solved this with the th (table header) element, which is used in place of td (cells), not in place of tr (rows). See e.g. this example. Yet of course, there is also the thead element which works more like the proposed ADT.

Docbook, has the rowheader attribute for this case.

If pandoc would use a HeaderCell as well, the HTML Writer could simply wrap the first n rows that only contain HeaderCells in a thead. It could also wrap the last n rows in the table with only HeaderCells in a tfoot.


We should try to ensure that there's only one way to represent a given table in the data type, and that bad tables are impossible to represent. See #3648.

As mentioned, I guess there are only two ways:

  • dependent types (@jgm, have you ruled this out categorically already?)
  • make sure everything goes through the builer (which would do some runtime corrections if necessary, like filling up missing columns)

I'm still pondering adding Attr to Cell. If only to give filters etc. an escape-hatch. And for complex use-cases that require the headers attribute in HTML, which "contains a list of space-separated strings, each corresponding to the id attribute of the <th> elements that apply to this element".


Incorporating all of the above, would lead to the following AST:

type Rowspan = Int
type Colspan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type Colwidth = Maybe Double
data Cell = Cell       Attr Rowspan Colspan [Block]
          | HeaderCell Attr Rowspan Colspan [Block]
          | NoCell -- if this slot is taken by anoter cell's row/colspan (idea from docutils snippet above)
type Row = [Cell]
-- constructor for Table in Block:
   | Table Attr Caption ShortCaption [(Alignment, Colwidth)] [Row]

(That is, unless we get the general possibly-floating container discussed in #3177, which could wrap the table. Then we could place the Caption and ShortCaption in that.)

Edit: An alternative to Cell ... | HeaderCell ... would be Cell CellType ...

Maybe we could get away with using LiquidHaskell to verify the table invariants. (I have close to no experience with LiquidHaskell, so I don't know whether it's a good choice for this use-case.)

mb21 commented

@tarleb LiquidHaskell might indeed work for this, whereas I'm not sure dependent types would (although I've no experience either), because we don't know at compile-time how long the table rows are going to be – we only know that all rows should have the same length.

Perhaps a structure that guarantees that the tables make sense can be composed without dependent types with something like this:

data Orientation = Vertical | Horizontal
data Split a = Split Orientation Rational a a
             -- ^ orientation, ratio, values
data Grid = GridSplit (Split Grid)
          | GridCell String

Edit: I've slightly messed it up at first, but the idea is the same.
Edit 2: Though there would be more than one way to represent a table, and it won't work for all kinds of tables.

mb21 commented

Just for the record, a simple dependent-types-based demo, without row- and col-spans though:

{-# LANGUAGE DataKinds, GADTs, StandaloneDeriving #-}

-- List with typed length
-- from https://www.schoolofhaskell.com/user/konn/prove-your-haskell-for-great-safety/dependent-types-in-haskell
-- for production, we could use https://hackage.haskell.org/package/sized instead
data Nat = Z | S Nat -- type-level natural numbers
data List n a where
  Nil  :: List Z a
  (:-) :: a -> List n a -> List (S n) a
infixr 5 :-
deriving instance Eq a => Eq (List n a)
deriving instance Show a => Show (List n a)

-- `Table n` and `Row n` have `n` columns
data Alignment = AlignLeft
               | AlignRight
               | AlignDefault deriving (Eq, Show)
type Colwidth = Maybe Double
data Cell = Cell String
  deriving (Show, Eq)
type Row n = List n Cell
data Table n = Table (List n Alignment) (List n Colwidth) [Row n]
  deriving (Show, Eq)

-- Sample
r1 = Cell "foo" :- Nil
r2 = Cell "bar" :- Nil
r3 = Cell "foobar" :- Cell "bar" :- Nil
t = Table (AlignDefault :- Nil) (Just 1.0 :- Nil) [r1, r2]
-- the following do not type-check:
-- Table (AlignDefault :- AlignDefault :- Nil) (Just 1.0 :- Nil) [r1, r2]
-- Table (                AlignDefault :- Nil) (Just 1.0 :- Nil) [r1, r3]
danse commented

there are a lot of valid contributions in this thread, and this is a complex and multi-faceted problem. i suggest to add a document to the repo about the design of the feature, and develop the discussion through pull requests to that document. advantages:

  • we will have a reference about the current status of the design at any time
  • the document can stay also afterwards as documentation about the code
  • discussion about the design can be threaded by specific changes to the design document

i hope you will agree that this is desirable, i find it difficult to keep all the comments here in mind in order to get an idea about where we want to go, it makes context switching more demanding. if we all agree about the method, it will be just a matter of who has availability to summarise this thread in a design document the soonest. i might work on this in the near future as we are facing the same problem

ickc commented

add a document to the repo

The wiki on GitHub can be used for this.

The wiki on GitHub can be used for this.

How would non-collaborators contribute? This article outlined a nice automated workflow. It seems pretty legit, the API token is encrypted, etc.

danse commented

i was thinking of something more straightforward, like a new .md file in doc or in a design folder. in the past i experimented with "documentation driven development", that is designing a feature by writing the documentation for the user. in our case, an user is also the developer of a writer/reader who will be interested in adding support for spans, and is thus interested in the structure of the data model and the rationale behind that

ickc commented

How would non-collaborators contribute?

Oh, I don't know if the pandoc wiki is setup to allow anyone to edit. I think it is but can anyone confirm?

I think it is but can anyone confirm?

Oh my...it is...that seems kinda dangerous. I (very briefly) added a second exclamation point at the top of the home-page and the edit was accepted. FWIW I don't think that should be editable by just anybody...

I personally think that a github-markdown document would be ideal, it could allow for task-lists etc. Either way, I think a maintainer / collaborator needs to create the Wiki or markdown document first before it can be edited by others. Not true, "outsiders" could create it.

Edit: One downside to the wiki is it's more difficult to understand who created what, and what happens with multiple editors working at once. That quick test did this

commit c030c0401d91087e6ef059b96bc7316b0449476c
Author: Stephen McDowell <svenevs@users.noreply.github.com>
Date:   Thu Feb 15 15:44:12 2018 -0800

    Updated Home (markdown)

commit 2170707e2ea0d41bf06ef3e0773d1cd4fa968153
Author: Stephen McDowell <svenevs@users.noreply.github.com>
Date:   Thu Feb 15 15:43:59 2018 -0800

    Updated Home (markdown)

GitHub makes it a little too easy to do that. Sorry for cluttering that, I was expecting a new screen like when you edit a file online for a PR, but it seems it just made the commit right away 😱

Benefit of Wiki: the maintainers of pandoc don't need to approve every PR that goes toward this specific discussion. That in its own right might be justification enough.

ickc commented

I think at least it requires a github account? And since it is a git the commits can be reversed.

It is danderous, because pandoc.org points to pandoc extras page of the wiki, and anyone can edit or even delete that page.

But I guess it had been working for pandoc for years and no body did any damage so far. So I guess it should only be changed when the hypothetical bad actor appears? (Note that someone did do something to the gitit demo wiki that forces @jgm to take it down. That’s why we don’t have a demo for gitit anymore.)

Yeah you do need a GitHub account, and there is a history attached so it can definitely be revived if the bad actor appears.

So what are your thoughts about this initial mock-up skeleton? I don't really understand too much about the (magic) behind pandoc, but if you think it's a good start then once the page is added anybody watching this issue can jump in

danse commented

thanks @svenevs! i copy your idea of collecting references and with this post i propose a summary of what happened above, hoping that it will help.

summary of the above

i think contributions can be grouped in three categories:

  • markdown syntax
  • plan of action
  • data model proposals, comparisons, ideas, constraints

personally, given the title of this issue, i think that discussions on markdown syntax should go elsewhere, it's a way more specific problem. the plan of action gives us a way to evolve the data model but there isn't agreement about how that data model will look like. overthinking is a good idea in this case given the cost of changing the design later.

so basically the "design document" i was thinking about consists of a few lines of haskell code that have been pasted here in different versions, and what's the agreement about it?

this is it, as far as i understand the point now is just to evolve this data model definition. i hope that this post can save time to others going through this long issue

danse commented

i've been thinking about data models that can give us geometrical
consistency without the need for dependent types. we don't want data
instances featuring cells that overlap, or areas of a table that are
not covered by any cell, or rows or columns that overflow outside
table boundaries.

i had the intuition that such constraints could be enforced by a
recursively defined table structure, but i found out that a table like
the following cannot be represented that way:

a b b    letters are repeated to represent
a c d    cells spanning multiple rows or columns
e e d

i keep this comment for the next person that will have a similar idea

I've stumbled upon the same issue with that "gridsplit" approach. Not sure if it can be avoided, but here's another sketch with dependent types (which doesn't account for rowspans and colspans being 0, but counts them otherwise):

data Cell : (colspan, rowspan: Nat) -> Type where
  MkCell : (cs, rs: Nat) -> String -> Cell cs rs

data RowSpan : Type where
  RS : (colspan, rowspan : Nat) -> RowSpan

data Row : (width: Nat) -> (rowspans: List RowSpan) -> Type where
  AddCell : Cell cs rs -> Row n rss -> Row (n + cs) (RS cs rs :: rss)
  EmptyRow : Row 0 []

rowSpanWidth : RowSpan -> Nat
rowSpanWidth (RS cs _) = cs

rowSpanIter : RowSpan -> Maybe RowSpan
rowSpanIter (RS cs Z) = Nothing
rowSpanIter (RS cs (S x)) = Just (RS cs x)

rowSpansWidth : List RowSpan -> Nat
rowSpansWidth = sum . map (\(RS cs _) => cs)

rowSpansIter : List RowSpan -> List RowSpan
rowSpansIter = mapMaybe rowSpanIter

data Table : (width, height: Nat) -> (rowspans: List RowSpan) -> Type where
  AddRow : Row rw rss
         -> Table w h prs
         -> {auto prf: (rw + rowSpansWidth (rowSpansIter prs) = w)}
         -> Table w (S h) (rowSpansIter $ rss ++ prs)
  EmptyTable : (rw: Nat) -> Table rw 0 []

data ValidTable : (width, height: Nat) -> Type where
  MkTable : Table w h rs -> {auto prf: rowSpansIter rs = []} -> ValidTable w h

trickyTable : ValidTable 3 3
trickyTable = MkTable $ 
            AddRow (AddCell (MkCell 2 1 "e") EmptyRow) $
            AddRow (AddCell (MkCell 1 1 "c") (AddCell (MkCell 1 2 "d") EmptyRow)) $
            AddRow (AddCell (MkCell 1 2 "a") (AddCell (MkCell 2 1 "b") EmptyRow)) $
            EmptyTable 3

Edit: though might be easier to add cells without any proofs first, and wrap everything in the end, counting all the widths and heights there.
Edit 2 (2018-06-19): this doesn't account for exact placements of row-spanning cells, still allowing invalid tables; a check for that should be added too.

danse commented

I'm working with tables now and i'm wondering whether Table could be a record, i think that it would make my life simpler. Any drawbacks? I don't see this in either of the proposals i collected above.

I'm not good at reading dependent types but also the last proposal from @defanor doesn't seem to use a record-like type.

I might be wrong, but using a record will also make it easier for us in the future to add properties without the need for extensive refactorings.

danse commented

an user in pandoc-discuss mentioned the value of allowing multiple headers ... not sure whether it's already been mentioned above, maybe we want to collect the use cases for the new data model somewhere else in order to make the design easier

ickc commented

Multiple headers has been discussed above.

By the way, some people might need a footer as well.

bpj commented

Am I understanding correctly that there are two distinct problems here?

  1. A model to represent extended table features in the AST qua data structure.

  2. How to implement such a model in Haskell.

mb21 commented

@bpj, well, the AST data structure is implemented in Haskell, so it's kind of one problem.

It's possible to implement a correct-by-construction grid using diagonalization, with cells placed on top.

I've begun an implementation at https://gist.github.com/dbaynard/9736e1e7c78da94f13da3ea6ed45f96f β€” I'd be grateful for feedback (and contributions).

Briefly, it assumes a table is a grid (represented in diagonal form) with cells (of any size or shape) stored at the first point the diagonal traversal encounters them. The implementation uses a GADT to ensure that only correct tables can be constructed.

I haven't got to the bit where I add the cells, but the algorithm should be quite straightforward to apply.

This by itself imposes no constraints on cells β€” e.g. header cells in specific places. But it seems that may be desirable.

(I used ghc 8.4.3, no dependencies other than base)


A diagonal traversal of an array (1 β†’ 20) looks as follows:

1  3  6 10 14
2  5  9 13 17
4  8 12 16 19
7 11 15 18 20

The GADT (see the full gist for the rest of the documentation and definition). When I talk about growing the table, I mean while descending the syntax tree. As constructors, these do the opposite, but I found it more helpful to think about them tearing down the data structure.

data T (n :: Nat) (extend :: Extending) a where
  -- | Grow the table height and width by 1, by cons-ing a new diagonal List
  -- of length one greater than the previous
  (:+:) :: List n a -> T (n + 1) extend a -> T n 'Diagonal a
  -- | Grow the table width by 1 at fixed height by cons-ing a new diagonal List
  -- to the right of the previous list
  (:-:) :: Or '[ 'Filling, 'Width] extend => List n a -> T n extend a -> T n 'Width a
  -- | Grow the table height by 1 at fixed width by cons-ing a new diagonal List
  -- below the previous list
  (:|:) :: Or '[ 'Filling, 'Height] extend => List n a -> T n extend a -> T n 'Height a
  -- | Fill the remaining table space by cons-ing a new diagonal list below
  -- and to the right of the previous list
  (:::) :: List n a -> T (n - 1) 'Filling a -> T n 'Filling a
  End :: T 0 'Filling a
jgm commented

@dbaynard - A representation as a list of rows would map much more easily onto the formats we are targeting. I guess I'd like to understand better what the advantage of the diagonal representation would be.

A representation as a list of rows would map much more easily onto the formats we are targeting.

Yes, I can see it would.

I guess I'd like to understand better what the advantage of the diagonal representation would be.

I would too β€” I need to investigate further whether it is useful. It may be able to guarantee that only valid tables are representable in the AST, yet all tables have the same type (meaning no need for dependent types/liquidhaskell/etc.). Also representations would be unique.

Perhaps even just proposing it may help us to find a solution that meets the criteria in #1024 (comment) and subsequent comments, even if it isn't this one.

mb21 commented

@dbaynard and me were pondering this at ZuriHac. Here's our thinking so far.

Basic requirements

  • col/row-spans
  • headers:
    • multiple rows in a header
    • secondary headers
    • row headers (e.g. have the first column be headers of the rest of the rows)
  • footer?
  • column-widths, alignments (as existing)

Concerning the headers, I feel fairly confident that CellType = DataCell | HeaderCell (analogous to the HTML <td> and <th>) is the best solution. It even allows us to do a simple table footer: The HTML Writer could simply wrap the first n rows containing only HeaderCells in a <thead> and the last n rows containing only HeaderCells in a <tfoot>.

What tables should be representable in the AST?

Let's have only rectangular cells (no overlapping cells).

In the spirit of not letting the perfect stand in the way of the good, we were thinking that a nested list of cells (like we currently have) is the best and most pragmatic solution. It should also make implementing the writers easier.

David had the insight that there wouldn't have to be any possibilities of invalid tables in the AST, if there was a well-defined way how to interpret any AST. Similar to how the HTML spec in some cases tells browsers how to insert certain missing (implicit) elements.

Pandoc could still emit warnings on missing cells. Additionally, a writer can choose to pad out missing rows (with cells of rowspan=1, colspan=1) or not.

Interesting HTML cases

<table>
  <tr>
    <td>1</td>
  </tr>
  <tr>
    <td>1</td>
    <td>2</td>
  </tr>
</table>

HTML validator warning:

A table row was 2 columns wide and exceeded the column count established by the first row (1).


<table>
  <tr>
    <td>1</td>
    <td>2</td>
  </tr>
  <tr>
    <td>1</td>
    <td rowspan="2">2</td>
  </tr>
</table>

HTML validator error:

Table cell spans past the end of its row group established by a tbody element; clipped to the end of the row group.


<table>
  <tr>
    <td colspan="2">2</td>
  </tr>
</table>

HTML validator error:

Table column 2 established by element td has no cells beginning in it.


<table>
  <tr><td>1</td></tr>
  <tr></tr>
  <tr><td>1</td></tr>
</table>

HTML validator error:

Row [...] has no cells beginning on it.


<table>
<tr>
  <td>1</td>
  <td rowspan="2">2</td>
</tr>
<tr>
  <td colspan="2">3</td>
</tr>
</table>

HTML validator error:

Table cell is overlapped by table cell.

Some of those in a jsfiddle.

pandoc's options

So what should a pandoc writer do when it encounters the equivalent of the cases above as a pandoc AST? What should the writer do with overlapping or missing cells (in the implicit grid)?

  • error out
  • interpret the table in a different way than in HTML, like push the overlapping cell so far right and to the bottom until it fits, possibly making the table larger.
  • pass the problem along, potentially outputting invalid HTML (or worse, invalid LaTeX which would make PDF output fail)
  • drop or crop cells in a deterministic way and emit a warning:
    • crop cells that would overlap with already filled space
    • crop (or drop) cells that would outgrow the row count established by the first row in the table (this is stricter than HTML, where this case is only a warning and the table grows to the right)
    • crop cells that would outgrow the row count established by the column with the smallest row count.
    • drop rows without any cells

At this point, I'm favouring the last option. This seems consistent with the feedback we got at ZuriHac, where people were like: "don't overthink it, do whatever HTML does, make sure the reader doesn't produce an invalid table and do whatever is easiest in the writer if someone produces an invalid table with a filter."

ADT

Coming back to the AST, which might then look like this:

type Rowspan = Int
type Colspan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type Colwidth = Maybe Double
data CellType = DataCell | HeaderCell
data Cell = Cell CellType Rowspan Colspan [Block]

Table Attr Caption ShortCaption [(Alignment, Colwidth)] [[Cell]]

There are still some things open:

  1. Instead of Caption and ShortCaption, the caption could also be handled by wrapping the table in a Figure with a caption (#3177)
  2. CellType could also include NoCell (this cell is occupied by a span of another cell)
  3. Potentially adding Attr to Cell. If only to give filters an escape-hatch. And for complex use-cases that usually boil down to that you want to specify which <th> cell is a heading for what <td> cells. Or alternatively, polymorphic Cell a (I'll have to reread Trees That Grow). This might be useful to instantiate differently when writing markdown to save width, heights of a cell.

Concerning 2. and 3. we should probably implement at least the HTML and markdown writers to get a feeling for how the AST format would impact implementation. We might put a function that validates/cleans-up a table in Writers.Shared.

bpj commented

Great writeup @mb21 (and discussion) β€” thank you!

I agree that we can use a straightforward list of lists representation, without dependent types, and decide how to handle the edge cases. We created a short list of tables that this representation should not handle (e.g. 3 dimensional tables, cell splits; more below). Tables Concepts β€’ Tables β€’ WAI Web Accessibility Tutorials was quite helpful.

Or alternatively, polymorphic Cell a (I'll have to reread Trees That Grow). This might be useful to instantiate differently when writing markdown to save width, heights of a cell.

The principle here is: we can reduce code duplication by having the same data structure for tables in the AST and in writers, but writers need different information (e.g. the dimensions of the output structures in characters/pixels). The advantage of Trees that grow is that there is no runtime cost.


Cell splits
welly commented

I've not read this thread entirely but I'm currently using pandoc (2.7.2) with wkhtmltopdf and am finding that the same issue is occurring. I wondered if anyone can explain why this won't work when using wkhtmltopdf to generate the html -> pdf conversion?

Thanks very much!

ickc commented

@welly, question like this might fit pandoc-discuss better. The issue tracker is for discussing the feature request. Currently the pandoc AST doesn't have a model for this so it isn't supported in pandoc yet.

Really cry for this feature.
Or any other tools to convert pandoc output of gfm to this format:

|  | 2 | 3 |
| --- | --- | --- |
| a | @cols=2: |
| b |  | test<br>ricky |
| c | @rows=2: | <br>Yes |
| e | No |

I went looking for multi-span rows and columns in pandoc, and ended up here. My interest is not with the science community, but in the writing of government standards. We are currently experimenting with writing the standard texts in markdown and rst, and use pandoc to convert them to PDF (via docbook). But the lack of rowspan support in the tables make it impossible to represent the layout of the original docx. It would thus would be great if pandoc supported spans. Sorry for not having any code to contribute, but thought it would be useful to know about yet another use case.

I found this after a search about formatting / laying out tables also.
We're looking at CMS options and how to generate the following table (as an example) would be useful as we require some of our tables to be pivoted 90 degrees for simple stuff:

<table>
  <caption>Dates and amounts</caption>
  <thead>
    <tr>
      <th scope="col">Date</th>
      <th scope="col">Amount</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th scope="row">First 6 weeks</th>
      <td>Β£109.80 per week</td>
    </tr>
    <tr>
      <th scope="row">Next 33 weeks</th>
      <td>Β£109.80 per week</td>
    </tr>
    <tr>
      <th scope="row">Total estimated pay</th>
      <td>Β£4,282.20</td>
    </tr>
  </tbody>
</table>

The markup is based on the UK Gov's Design System: https://design-system.service.gov.uk/components/table/ - but we have a use case for these types of tables.

A strong use case for this are some 3GPP specs that contain bit fields, e.g. https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=3111
Would love to be able to view them as markdown or org in my Emacs, but all of the bitfields are off ;(

jgm commented

For those who have been following this issue, we have a PR for new table types here:
jgm/pandoc-types#66
I want to avoid excessive bikeshedding on this issue, but if you have final comments, now is the time. The type allows for column and row spans, short captions, attributes, multiple header rows, footers, intermediate headers, and overriding alignments at the cell level.