Rendering from HTML to Djot gives unexpected results

Question

Rendering from HTML to Djot gives unexpected results

tbdalgaard opened this issue a year ago · 5 comments

When converting a document from Djot to HTML and later back to Djot again there can be some interesting side effects, like explicit heading identifiers being broken since the HTML render both makes id's for the section-tag and the heading-tag.

Here is how to reproduce:

Save this file as Djot:


{#top}
# Djot test document

Welcome to this test document.

{#second}
Here is the second paragraph.

## Moving to links

We will now show that we can jump to the [top](#top) and to the [second paragraph](#second) with these two links.

The end```

Now convert this document to HTML by running Djot via Pandoc. This is important since we want a stand-alone HTML-version of the document. I did:
djot -f djot hello.dj.txt -t pandoc | pandoc -f json -t html -s -o hello.html

Pandoc will give a warning since there is no title specified when we converted this.

Now try to convert the document from HTML back to Djot again by doing:

pandoc hello.html -f html -t json | djot -f pandoc -t djot > hello2.dj.txt

This shows the following document

{#top}
{#djot-test-document}
# Djot test document

Welcome to this test document.

Here is the second paragraph.

{#Moving-to-links}
{#moving-to-links}
## Moving to links

We will now show that we can jump to the [top](#top) and to the [second
paragraph](#second) with these two links.

The end

The following changed during the convertion:

The explicit heading identifiers was ignored and another identifier was given, and that may break the headings. Example: The original identifier for the Djot test document heading was top, but the djot-test-document was added, and this would ignore the top identifier, and make # Djot test document heading shown as raw Djot and not the h1-heading. @jgm says that the section-html- tag may be in the converted HTML-document and that is exactly right. How can this be fixed?
The link to the second paragraph which was defined by the {#second} attribute was removed from the HTML-document, so the link would not work. Why was this removed?

If I on the other hand convert the original document via Djot to HTML the link to second will work, but then I do not get the stand-alone HTML-version as Pandoc can produce.

Answer 1 · 2023-01-15T18:00:45.000Z

Note that the conversion via pandoc to HTML is fine:

<section id="top" id="top">
<h1>Djot test document</h1>

The problem is that in converting this back from HTML, pandoc introduces an automatic identifier on the heading. (Pandoc AST output:)

[ Div
    ( "top" , [ "section" ] , [] )
    [ Header
        1
        ( "djot-test-document" , [] , [] )

I'm inclined to call this a bug in pandoc; pandoc should avoid adding these identifiers when the heading is the first element in a section that already has an id.

You can work around this though by using -f html-auto_identifiers with pandoc; this turns off the auto_identifiers extension.

Answer 2 · 2023-01-15T18:10:00.000Z

Ah, ok that makes sense. What about the issue where the attribute second is gone from the html that I render via Pandoc, could that be a bug too?

Answer 3 · 2023-01-15T19:07:57.000Z

On second: the problem here is that pandoc's AST does not allow you to attach attributes directly to paragraphs, so it's just getting ignored. I suppose we could have djot's pandoc renderer create an enclosing div in this case.

Answer 4 · 2023-01-15T19:34:06.000Z

That would be ok. I wonder how we can avoid conflicts between Djot and Pandoc for the future? Not to offend anyone, but I really see Djot as a format for non programmers too, and therefor I got a little worried when a quite simple document as mine changed that much, when I had to use Pandoc to get Djot syntax from HTML. Wonder if Djot could take over more of the conversion so Pandoc would convert based upon what Djot can do, so we perhaps can avoid conflicts in different ASTS either in Djot or Pandoc.

Answer 5 · 2023-01-15T19:37:20.000Z

Ah ok that sounds like a bug in Pandoc to mee as well. I have seen Pandoc get this right when using Markdown as source. Would you like me to make an issue over there about this?