Docx -> HTML: Pandoc discards comments on tables, table rows and table cells
bvobart opened this issue · 0 comments
Explain the problem.
I'm working on a project where we convert Docx files to accessible HTML. We do some of our own pre- and postprocessing, but the main conversion from Docx to HTML is done by Pandoc. In order to track how each element of a Docx file is transformed throughout the preprocessing, Pandoc and postprocessing, we place UUID comments on each element that we want to track, e.g. as a simplified example for paragraphs (w:p
):
<w:p>
<w:commentRangeStart w:id="0"/>
...
<w:r>
<w:t>Some Text</w:t>
</w:r>
...
<w:commentRangeEnd w:id="0"/>
</w:p>
The comments.xml of that document will then contain a w:comment
with ID 0 and a UUID as contents.
Now, we want to track what happens to tables, so I tried adding comments to w:tbl
, w:tr
and w:tc
elements in a similar way to the above example for w:p
. However, these comments never get translated to the resulting HTML. Only the comments that are placed on the w:p
s within the table cells, end up on the HTML td
elements, but the comments on the w:tr
do not end up on the HTML tr
or thead
elements and the comments on the w:tbl
elements also do not end up on the HTML table
element. In both those cases, the comments are simply discarded.
This is the command I'm using to call Pandoc:
pandoc --from docx --to html --output result.html --track-changes=all annotator_tables_test_gh_issue.docx
See here for an example file containing several tables, where each w:tbl
, w:tr
, w:tc
and w:p
has been annotated with a UUID comment (note: they're only visible in the XML, not when you open the file with Word):
annotator_tables_test_gh_issue.docx
My expectation is that the w:tbl
comments end up in the HTML table
element, the w:tr
comments end up in the HTML tr
element and the w:tc
or w:p
comments end up on the HTML td
element.
Pandoc version?
3.1.13