pandoc/lua-filters

pagebreak filter doesn't work with Commonmark

dmurdoch opened this issue · 7 comments

The pagebreak.lua filter depends on the raw_tex extension on the markdown reader, but that extension is not supported by commonmark or commonmark_x. This results in \pagebreak or \newpage being written to the output file with the backslash escaped, so the macro is visible instead of being translated into a page break.

Example: working in the lua-filters/pagebreak directory, this command

  pandoc --from commonmark  --to pdf sample.md -o sample.pdf --lua-filter pagebreak.lua

produces this output:

Screen Shot 2022-11-29 at 12 26 52 PM

The solution is to look for the macros in the Para() function of the filter. A complication is that commonmark+sourcepos splits the macros into two parts and wraps them in Span, the Para() function needs to handle that case too.

You can make this work in CommonMark with

```{=latex}
\pagebreak
```

Requires the raw_attribute extension which is enabled by default in commonmark_x.

Sure, but my thinking went as follows:

In favour of the change:

  • there are a lot of existing documents using the simpler syntax, and they'll all be broken if Pandoc transitions to CommonMark without this change. It was one of the first issues I saw when I tried to use the sourcepos extension in R Markdown documents.
  • Markdown is supposed to be readable, and it's more readable than the fenced solution.

Against the change:

  • It doesn't fit the CommonMark design very well, which is the reason the raw_tex extension is incompatible with the commonmark reader. The spec says "Backslashes before other characters are treated as literal backslashes".

But CommonMark doesn't provide a way to enter a page break, so it needs to be some kind of extension, and this seems like a fairly harmless one. People who really want paragraphs containing nothing but \newpage or \pagebreak should just avoid using the filter.

I think my preferred solution here would be to create a new filter that converts the special paragraphs into LaTeX, e.g.,

function Para (p)
  if is_pagebreak(p) then
    return pandoc.RawBlock('latex', pandoc.utils.stringify(p))
  end
end

Users would run the filter before pagebreak.lua.

There are two reasons for that:

  1. It's cleaner.
  2. Making the filter act on Para elements has a significant performance impact; most users should not have to pay that.

I'd be more open to adding support for special div's, so commonmark_x users could write

::: pagebreak
:::

or

{.pagebreak}
---

For plain CommonMark, an HTML-based syntax could be acceptable:

<hr class="pagebreak"/>

The existing filter already works on Para elements, it looks for a single FF character there. The proposed test makes the test more complicated and so it will be slower, but is it really enough of a difference to be noticeable? (In the context where I'm using it I think the answer is almost certainly no: I run knitr, then Pandoc, then pdflatex. The Pandoc step is almost always very quick compared to the others.)

You're right. I forgot about that. I'm still hesitant to add this kind of special case here.

Regarding your proposed syntax choices: I think the one using ::: is the most readable, so it's the one I'd choose if new syntax is needed. But the back-compatibiity of \pagebreak (and its familiarity to people who know LaTeX) are still positives for it.

I've moved the code for the pagebreak filter to pandoc-ext/pagebreak. The code has been updated to be more configurable; it would now be easier to implement the suggested changes without the mentioned drawbacks. PRs welcome.

Closing this here.