jgm/djot.js

Rendering from HTML to Djot gives unexpected results

tbdalgaard opened this issue · 5 comments

When converting a document from Djot to HTML and later back to Djot again there can be some interesting side effects, like explicit heading identifiers being broken since the HTML render both makes id's for the section-tag and the heading-tag.

Here is how to reproduce:

Save this file as Djot:


{#top}
# Djot test document

Welcome to this test document.

{#second}
Here is the second paragraph.

## Moving to links

We will now show that we can jump to the [top](#top) and to the [second paragraph](#second) with these two links.

The end```

Now convert this document to HTML by running Djot via Pandoc. This is important since we want a stand-alone HTML-version of the document. I did:
djot -f djot hello.dj.txt -t pandoc | pandoc -f json -t html -s -o hello.html

Pandoc will give a warning since there is no title specified when we converted this.

Now try to convert the document from HTML back to Djot again by doing:

pandoc hello.html -f html -t json | djot -f pandoc -t djot > hello2.dj.txt

This shows the following document

{#top}
{#djot-test-document}
# Djot test document

Welcome to this test document.

Here is the second paragraph.

{#Moving-to-links}
{#moving-to-links}
## Moving to links

We will now show that we can jump to the [top](#top) and to the [second
paragraph](#second) with these two links.

The end

The following changed during the convertion:

  1. The explicit heading identifiers was ignored and another identifier was given, and that may break the headings. Example: The original identifier for the Djot test document heading was top, but the djot-test-document was added, and this would ignore the top identifier, and make # Djot test document heading shown as raw Djot and not the h1-heading. @jgm says that the section-html- tag may be in the converted HTML-document and that is exactly right. How can this be fixed?
  2. The link to the second paragraph which was defined by the {#second} attribute was removed from the HTML-document, so the link would not work. Why was this removed?

If I on the other hand convert the original document via Djot to HTML the link to second will work, but then I do not get the stand-alone HTML-version as Pandoc can produce.

jgm commented

Note that the conversion via pandoc to HTML is fine:

<section id="top" id="top">
<h1>Djot test document</h1>

The problem is that in converting this back from HTML, pandoc introduces an automatic identifier on the heading. (Pandoc AST output:)

[ Div
    ( "top" , [ "section" ] , [] )
    [ Header
        1
        ( "djot-test-document" , [] , [] )

I'm inclined to call this a bug in pandoc; pandoc should avoid adding these identifiers when the heading is the first element in a section that already has an id.

You can work around this though by using -f html-auto_identifiers with pandoc; this turns off the auto_identifiers extension.

jgm commented

On second: the problem here is that pandoc's AST does not allow you to attach attributes directly to paragraphs, so it's just getting ignored. I suppose we could have djot's pandoc renderer create an enclosing div in this case.