Broken superscripts (references to bibliography items, footnotes, author affiliations, etc.)
XZF0 opened this issue · 1 comments
Superscript elements do not seem to be handled in any particular way - they are just fused with the word they follow. This is a bug, since it results in creating invalid words (and losing the information in the superscript element). This affects probably the majority of science paper pdfs. Mangled names of authors in particular look rather disrespectful :)
-
Separating superscript text from the preceding word with a whitespace would already be a substantial improvement.
-
A configurable representation for superscripts would be even better (escaped square brackets might be a reasonable default, or
<sup/>
tag, supported in many md viewers).
(handling the semantics of references etc. is probably out of scope for a document parser - the downstream logic should be able to do that, given a reasonable representation).