jgm/pandoc

Docx -> HTML: Pandoc discards comments on tables, table rows and table cells

bvobart opened this issue · 0 comments

Explain the problem.
I'm working on a project where we convert Docx files to accessible HTML. We do some of our own pre- and postprocessing, but the main conversion from Docx to HTML is done by Pandoc. In order to track how each element of a Docx file is transformed throughout the preprocessing, Pandoc and postprocessing, we place UUID comments on each element that we want to track, e.g. as a simplified example for paragraphs (w:p):

<w:p>
  <w:commentRangeStart w:id="0"/>
  ...
  <w:r>
    <w:t>Some Text</w:t>
  </w:r>
  ...
  <w:commentRangeEnd w:id="0"/>
</w:p>

The comments.xml of that document will then contain a w:comment with ID 0 and a UUID as contents.

Now, we want to track what happens to tables, so I tried adding comments to w:tbl, w:tr and w:tc elements in a similar way to the above example for w:p. However, these comments never get translated to the resulting HTML. Only the comments that are placed on the w:ps within the table cells, end up on the HTML td elements, but the comments on the w:tr do not end up on the HTML tr or thead elements and the comments on the w:tbl elements also do not end up on the HTML table element. In both those cases, the comments are simply discarded.

This is the command I'm using to call Pandoc:

pandoc --from docx --to html --output result.html --track-changes=all annotator_tables_test_gh_issue.docx

See here for an example file containing several tables, where each w:tbl, w:tr, w:tc and w:p has been annotated with a UUID comment (note: they're only visible in the XML, not when you open the file with Word):
annotator_tables_test_gh_issue.docx

My expectation is that the w:tbl comments end up in the HTML table element, the w:tr comments end up in the HTML tr element and the w:tc or w:p comments end up on the HTML td element.

Pandoc version?
3.1.13