jgm/commonmark-hs

`commonmark-pandoc`: calculate relative cell widths for pipe tables

max-heller opened this issue · 11 comments

Pandoc's Markdown parser sets relative widths for each column in a pipe table when the table contains long rows, which allows LaTeX to wrap cells to avoid overflowing the page (implemented in jgm/pandoc@eb8aee4).

commonmark-pandoc doesn't implement this functionality and always sets the column width to default, resulting in poor PDF rendering:

colspecs = map (\al -> (toPandocAlignment al, ColWidthDefault))
aligns

It'd be great if this feature could be implemented in commonmark-pandoc, but I don't have the familiarity with Haskell or the codebase to do it myself.

jgm commented

commonmark-pandoc can only do what the core commonmark-extensions API for tables supports, so that is where the change would need to happen. `commonmark-pandoc' is just a thin wrapper.

commonmark-pandoc can only do what the core commonmark-extensions API for tables supports, so that is where the change would need to happen. `commonmark-pandoc' is just a thin wrapper.

Not sure I follow: the code is currently passing ColWidthDefault, so it seems like it could also pass a specific ColWidth width.

jgm commented

Well, how would it figure out what ColWidth to use? commonmark-pandoc doesn't have access to the widths of separator lines.

Another issue is that computing relative widths requires a standard for 100%. In pandoc this is adjustable using --columns. Not sure how this would be handled in commonmark.

Well, how would it figure out what ColWidth to use? commonmark-pandoc doesn't have access to the widths of separator lines.

Ah, I understand now: commonmark-extensions's HasPipeTable typeclass doesn't pass through the separator widths, so that'd have to be changed first.

Another issue is that computing relative widths requires a standard for 100%. In pandoc this is adjustable using --columns. Not sure how this would be handled in commonmark.

Perhaps commonmark-pandoc could always compute relative widths (using the separator line approach) and then leave it to the pandoc writer to determine whether to use them or determine widths automatically (if pandoc thinks the rows fit within --columns)?

jgm commented

Perhaps commonmark-pandoc could always compute relative widths (using the separator line approach) and then leave it to the pandoc writer to determine whether to use them or determine widths automatically (if pandoc thinks the rows fit within --columns)?

This would require pandoc to be parsing the document rather than handing off parsing to commonmark. Pandoc doesn't know what the line widths are -- it just sends commonmark the whole text and gets back an AST.

This would require pandoc to be parsing the document rather than handing off parsing to commonmark. Pandoc doesn't know what the line widths are -- it just sends commonmark the whole text and gets back an AST.

I was imagining Pandoc would parse with commonmark, get an AST, then estimate rendered line widths based on the AST for each row.

If I understand the current approach, it seems like the column width of the markdown source is used to determine whether a line is long:

| column a | column b |
| -------- | -------- |
| normal   | more normal |
| something | reaaaaaaaaaaaaaaaaaaaaaally long | # considered long

With commonmark doing the parsing, Pandoc wouldn't be able to tell the length of the source line, but it'd get an AST that looks something like:

headers: ["column a", "column b"]
separators: ["--------", "--------"] or [8, 8] # separator widths added to the `commonmark-extensions` API
rows: [
    ["normal", "more normal"],
    ["something", "reaaaaaaaaaaaaaaaaaaaaaally long"],
]

With the AST, Pandoc would then sum up an approximation of width for each cell in a row, and determine whether row exceeds --columns.

Approximating the rendered width -- based on either the source markdown or the AST -- seems tricky in the presence of short commands that expand to long output and long commands that expand to short output (e.g. \textendash), but if a rough approximation is okay, it seems doable to perform on the AST layer as well.

jgm commented

You could get a rough approximation that way, maybe.
But there are lots of cases where it would fail. E.g. a Link in the AST could have been a short reference link or a long inline link in the source.

You could get a rough approximation that way, maybe. But there are lots of cases where it would fail. E.g. a Link in the AST could have been a short reference link or a long inline link in the source.

Is that not already an issue with the source-based approach? [link](reaaaaaaaaaaaaaaaaaaaally loooooong link) looks long in the source but is short when rendered

jgm commented

What I mean is that it could not match the behavior specified in the manual, which refers to the length of the source line -- which we could only guess about.

What I mean is that it could not match the behavior specified in the manual, which refers to the length of the source line -- which we could only guess about.

I see. As a clunky but correct approach then, how about passing the source (or the source length) for each row through commonmark-extensions along with the parsed rows?

jgm commented

We should be able to change

class HasPipeTable il bl where
  pipeTable :: [ColAlignment] -> [il] -> [[il]] -> bl

to something like

class HasPipeTable il bl where
  pipeTable :: [(ColAlignment, ColWidth)] -> [il] -> [[il]] -> bl

in commonmark-extensions. We can have the extension calculate widths in the same way as pandoc. Then it would be trivial to have commonmark-pandoc use this information.

The one potential drawback is that our pipe tables would sometimes render differently from GFM's -- and this might be a problem for some users.

To fix that, we could make pipeTableSpec parameterizable.