jgm/pandoc-types

Improving tables

Closed this issue · 20 comments

See the main todo list and the relevant issue. I would like to start implementing better table handling in Pandoc. Specifically, I would implement all but the last of these bullet points using one of the designs below (or a modified version of one of them).

I think something like this recently outlined approach is a good way forward for now. The representation is a little loose (any table in the intermediate representation is valid, so there are multiple ways to write a given table, but only one normalized way), but it should allow the readers and writers to be switched more easily. This is slightly modified version of that approach:

type RowSpan = Int
type ColSpan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type ColWidth = Maybe Double
data CellType = DataCell | HeaderCell
data Cell = Cell Attr CellType (Maybe Aligment) RowSpan ColSpan [Block]
type Row = [Cell]

data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [Row]
  ...

The Maybe Alignment on the individual cells allows the cells to override the alignment of the column(s) in which they reside. This makes it easier to specify one's intentions when a cell spans multiple columns with conflicting alignments, and has the advantage of allowing better \multicolumn and \multirow support in the LaTeX reader and writer. It also comes up naturally when one thinks of possible extensions to the supported markdown table formats.

A similar design has the following modifications:

data Cell = Cell Attr (Maybe Alignment) RowSpan ColSpan [Block]
data HeaderRow = Row Attr [Cell]
data BodyRow = Row Attr [Cell] [Cell]

data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [HeaderRow] [BodyRow] [HeaderRow]
  ...

This has the advantage of making explicit the table head/body/foot and row head/body structure that seems to be assumed in the first approach, where the first entirely header rows become the table head, and the last such rows become the table foot. Cells in the head and foot sections would correspond to th cells, and cells in body section would correspond to td cells. It does not require a CellType, but one could still be added, making these even more similar to HTML tables. This approach has the disadvantage of making the table representation more complex.


I assume that the tables are normalized (laid on a grid with a given width so that overlapping cells and empty spaces can be dealt with in the table) like so, informally:

  1. Empty rows are filtered out from the table
  2. The grid has a height equal to the number of rows in the table, and some fixed width.
  3. Rows are laid on the grid from top to bottom.
  4. The top of each cell is as far down on the grid as it is on the table.
  5. The top-left corner of each cell, in turn, is placed on the leftmost empty grid space on the row, if it exists within the grid width, and is otherwise dropped. If it would overlap a cell on a previous row or extend past the remaining grid width, its width (ColSpan) would be lowered to fit. If it would extend past the bottom of the grid, its height (RowSpan) would be lowered to fit.
  6. If there are too few cells in a row to fill the available width, then blank cells are added to the end of the row.

The table head, table foot, row head (the list of row head sections without the row body), and row body (the list of row body sections without the row head) should be normalized independently in any design where these exist (implicitly in the first, or explicitly in the second). The overall table width would be the length of the [(Alignment, ColWidth)] list, and the row head/body width would add to that width. (The row head width would be the width of the first row in the row head).

jgm commented

I'm delighted that you're interested in taking this on. It's one of the top priority improvements for pandoc, but it has been hard to get it done because (a) it's a big change and (b) it's hard to decide what the best type is.

I don't think we should let the perfect be the enemy of the good: we should discuss (b), but we should set a limit to how long we discuss it before just moving ahead with something that will be better than what we have currently. (If needed, we can make further incremental changes in the future.)

More later...

jgm commented

The first approach allows any cell to be a header cell. That might be an advantage for representing tables where the left column is the header (not common) -- such tables can't be represented in the second approach -- but it has the disadvantage that many table formats can't represent arbitrary header cells. (HTML is an exception obviously.) So I'm leaning more to the second approach. I don't know how important it is to represent tables where the header is a column rather than a row, and I'm not sure what the cost would be of unrepresentable tables on the first approach.

The layout I have in my mind, incidentally, is this:

+---------------------+
|     Table Head      |
+----------+----------+
| Row Head | Row Body |
+----------+----------+
|     Table Foot      |
+---------------------+

since I realized that the second representation might suggest that the row headers are not under the table head. This is also the implicit layout of the first approach.

When you say that "the left column is the header", do you mean that the table is transposed during writing so that the table head rows become columns? Otherwise I think that the row head section could be used as the header. The only oddity would be that a single header line would be split up among multiple rows.

In the first approach, I suppose that after separating out as many sections as the writer supports (table head, foot, row head) the writer would forget about the cell type and simply write the cell content as-is.

jgm commented

I wasn't understanding what you mean by Row Head. Now I see you mean a header cell in the left position in a row. And now I notice that you have Row Attr [Cell] [Cell] -- the first group of cells is the row header, the rest the body. OK, that makes sense. More in a bit.

jgm commented

I'm wondering whether it would make things easier if the types were a bit more uniform. Rendering a header row will often be almost the same as rendering a body row. What if we just had

data Row = HeaderRow Attr [Cell] {- row heads -} [Cell] {- other cells -}
data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] [Row] {- header -} [Row] {- body -} [Row] {- footer -}

The drawback is that this allows you to represent distinctions that are irrelevant in the header and footer rows. The advantage is that it makes it easier to deal with rows in a uniform way in the code. I'm not really sure about this tradeoff.

If we do go with your original approach, we'll need a different type constructor:

data HeaderRow = HeaderRow Attr [Cell]

Another approach might be:

data Row a = Row a Attr [Cell]
data HeaderRow
data FooterRow
data BodyRow = BodyRow [Cell]
...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)]
         [Row HeaderRow] [Row BodyRow] [Row FooterRow]

Writers that can't represent row headers might find it easier to concatenate the row head and body and operate on an [(Attr, [Cell])], or even a [[Cell]], list.

If there were a uniform row type, then the table picture could be

+-------------------+-------------------+
| TH above row head | TH above row body |
+-------------------+-------------------+
|     Row Head      |     Row Body      |
+-------------------+-------------------+
| TF below row head | TF below row body |
+-------------------+-------------------+

I am not sure if this is a useful distinction, but it does give the row header in the table head some meaning.

jgm commented

If I'm understanding you correctly, you are now suggesting a uniform type

data Row = Row Attr [Cell] [Cell]

to be used for the header, body, and footer? That sounds good to me. It's conceivable that some formats could treat "TH above row head" specially.

That was my interpretation of the first Row type in your previous comment. I think it's reasonable; at worst it's another distinction in the intermediate representation for writers to ignore to a greater or lesser degree, and the uniformity could be of some benefit, though it's hard to say without actually starting to implement the change.

jgm commented

OK, to summarize then:

type RowSpan = Int
type ColSpan = Int
type Caption = [Block]
type ShortCaption = [Inline]
type ColWidth = Maybe Double
data Cell = Cell Attr (Maybe Aligment) RowSpan ColSpan [Block]
type RowHead = [Cell]
type RowBody = [Cell]
data Row = Row Attr RowHead RowBody
type TableHead = [Row]
type TableBody = [Row]
type TableFoot = [Row]
data Block =
  ...
  | Table Attr Caption ShortCaption [(Alignment, ColWidth)] TableHead TableBody TableFoot

@tarleb - what do you think of this?

Looks good to me!

Maybe we could group the arguments to Table more, e.g. by introducing a type data BareTable = BareTable TableHead TableBody TableFoot or something similar?

jgm commented

If we want to compress things, I'd prefer something like

data Caption = Caption (Maybe [Inline]) [Block] -- short caption, full caption
type ColSpec = (Alignment, Maybe Double)
data Block =
  ...
  | Table Attr Caption [Colspec] TableHead TableBody TableFoot

And should we consider using newtypes instead of type for things like TableBody and Colspec?

Having the caption components bundled together would be good. That bundling might happen anyway with a new Figure block.

I'm not sure how great the benefit of newtyping would be. The ColSpec could be data ColSpec = ColSpec ... I suppose.

jgm commented

Advantage of a newtype is that the types then enforce the distinctions. With type synonyms you won't get an error if you put a TableHead where a TableBody should go, etc.
Disadvantage is that it's a bit more cumbersome doing pattern matching, etc. However, it's not too hard, and there's always coerce.

jgm commented

That said, we use type aliases all over the place in pandoc-types now (e.g. Attr), so maybe this isn't the time to change.

Perhaps it isn't the time to change type/newtype approaches.

It sounds like the most recent summary, with the modified data Caption, is acceptable. If that is the case, I can start working from that design.

jgm commented

Sounds good to me!

Hello,

I just wanted to share that I'm in the process of submitting a Google Summer of Code 2020 proposal to provide a library with similar functionality, as it seems to be something many Haskell packages could benefit from, not least of all pandoc. The exact API is not finalized, but the proposal is in rough draft form at the moment. I do hope this is something that can benefit this project and many others.

jgm commented

@Mercerenies - the proposal looks quite interesting. How were you thinking it intersects with pandoc? Do have any suggestions about to the proposal above, or does it seem reasonable to you?

The timing ended up being quite inconvenient, as I reached out to @tarleb about the proposal a few days before this issue was opened. That being said, I do still feel like a dedicated library for this kind of thing would be very nice to have, for several reasons, even if pandoc has its own type as well.

In terms of the above proposal, I share the concern about type vs newtype but understand why making that change would be an inconsistency with the rest of the library. Aside from that, what you said above seems pretty reasonable. I'd personally go for the version that doesn't involve a 7-arg constructor.

jgm commented

I completely agree that a dedicated library could be useful even if pandoc has its own type -- and there could be glue code converting between pandoc tables and this library's type.