jgm/pandoc

Heading surrounded by `<details>`/`</details>` HTML blocks produces invalid EPUB

max-heller opened this issue · 6 comments

Explain the problem.
Include the exact command line you used and all inputs necessary to reproduce the issue. Please create as minimal an example as possible, to help the maintainers isolate the problem. Explain the output you received and how it differs from what you expected.

Converting the following Markdown to EPUB produces invalid output:

---
title: 'Title'
...

<details>

## Heading

text

</details>

Assuming test.md contains the above, I'm running:

pandoc -f test.md -o test.epub

epubcheck return the following errors on the converted EPUB:

$ epubcheck test.epub
Validating using EPUB version 3.3 rules.
ERROR(RSC-005): test.epub/EPUB/text/ch001.xhtml(16,38): Error while parsing file: element "section" not allowed yet; missing required element "summary"
FATAL(RSC-016): test.epub/EPUB/text/ch001.xhtml(19,3): Fatal Error while parsing file: The element type "section" must be terminated by the matching end-tag "</section>".

Check finished with errors
Messages: 1 fatal / 1 error / 0 warnings / 0 infos

EPUBCheck completed

Apple Books displays the following error but renders the content correctly (unless any content follows the </details>):

This page contains the following errors:error on line 18 at column 11: Opening and ending tag mismatch: section line 16 and details
Below is a rendering of the page up to the first error.

The input parses, correctly, as:

[ RawBlock (Format "html") "<details>"
, Header 2 ( "heading" , [] , [] ) [ Str "Heading" ]
, Para [ Str "text" ]
, RawBlock (Format "html") "</details>"
]

The example without the <details>/</details>, or without the ## Heading, produces valid EPUB.

I'd expect the output to match the rendered HTML, which interprets the input as a combined <details></details> tag, or to ignore the HTML blocks, as the PDF writer seems to do.

Pandoc version?
What version of pandoc are you using, on what OS? (If it's not the latest release, please try with the latest release before reporting the issue.)
pandoc 3.1.12.3, MacOS 14.3.1

jgm commented

For EPUB we need to split the document into sections, and the makeSections function just sees these raw HTML blocks as black boxes; that's why you get the invalid nesting. There's no good general solution to this problem. It might help pandoc if you put the whole thing in a div:

::: details

<details>

## Heading

text

</details>

:::details

since the Div structure is something pandoc looks at in makeSections. (Untested.)

It might help pandoc if you put the whole thing in a div:

This arrangement works:

<details>

::: details

## Heading

text

:::

</details>

For EPUB we need to split the document into sections, and the makeSections function just sees these raw HTML blocks as black boxes; that's why you get the invalid nesting. There's no good general solution to this problem.

Theoretically, could makeSections parse raw HTML blocks and take matching tags into account when splitting the document into sections? I'm guessing Pandoc probably doesn't want to do that, but would it be correct?

jgm commented

Theoretically yes, but it would add a lot of complexity and I don't think it's worth it, probably.

Theoretically yes, but it would add a lot of complexity and I don't think it's worth it, probably.

Any other way we could fix rendering of documents like this, even if the output isn't perfect? I'm guessing stripping out raw html blocks is too coarse grained since they're generally easily translatable to EPUB/XHTML. Stripping html blocks that contain unterminated tags (also involves parsing, but a little simpler)?

jgm commented

Stripping them is fine if you don't want them to appear in the EPUB, but people might very well include raw HTML that they do want to appear in the EPUB.

It seems that there is no real, viable solution – hence closing. Please reopen if I misunderstood.