jgm/pandoc

HTML>latex hides word after a tilde ~ but HTML>md>latex won't

florianm opened this issue · 2 comments

Tested only on Ubuntu 12.04 with pandoc 1.9.1.1 (compiled with citeproc-hs 0.3.4, texmath 0.6.0.3, highlighting-kate 0.5.0.5)

TL,DR: Converting HTML to Latex directly hides any alphanumeric words following a tilde without whitespace.
Converting the same HTML document first to markdown, then to Latex, will preserve words following a tilde.

Example: test.html

<html><body>
<h1>First chapter</h1>
<p>The word after a tilde ~ will be missing. Example: ~can't ~touch ~this.</p>
<p>One little tilde sat on a wall. ~ Two little tildes had a bad fall. ~~ Three little tildes just wanted a hug. ~~~ Four little tildes show it's a bug. ~~~~</p>
</body></html>

Converting to markdown

$ pandoc test.html -o fromhtml.md

creates fromhtml.md:

First chapter
=============

The word after a tilde \~ will be missing. Example: \~can't \~touch
\~this.

One little tilde sat on a wall. \~ Two little tildes had a bad fall.
\~\~ Three little tildes just wanted a hug. \~\~\~ Four little tildes
show it's a bug. \~\~\~\~

Converting that to latex

$ pandoc fromhtml.md -o frommd.tex

creates frommd.tex:

\section{First chapter}

The word after a tilde \ensuremath{\sim} will be missing. Example:
\ensuremath{\sim}can't \ensuremath{\sim}touch \ensuremath{\sim}this.

One little tilde sat on a wall. \ensuremath{\sim} Two little tildes had
a bad fall. \ensuremath{\sim}\ensuremath{\sim} Three little tildes just
wanted a hug. \ensuremath{\sim}\ensuremath{\sim}\ensuremath{\sim} Four
little tildes show it's a bug.
\ensuremath{\sim}\ensuremath{\sim}\ensuremath{\sim}\ensuremath{\sim}

Note that following words, as well as consecutive tildes are preserved.

Now converting the original HTML directly into Latex will hide following words and tildes:

$ pandoc test.html -o fromhtml.tex

The resulting latex file fromhtml.tex:

\section{First chapter}

The word after a tilde \ensuremath{\sim} will be missing. Example:
\ensuremath{\sim}'t \ensuremath{\sim} \ensuremath{\sim}.

One little tilde sat on a wall. \ensuremath{\sim} Two little tildes had
a bad fall. \ensuremath{\sim} Three little tildes just wanted a hug.
\ensuremath{\sim} Four little tildes show it's a bug. \ensuremath{\sim}

Why does pandoc create two different results here, as it converts any input format into its own markdown dialect, then into the specified output format?
I am aware that tildes have a special meaning in markdown, but as they come in my example from HTML, they seem not to be escaped properly.

jgm commented

This was a bug in pandoc 1.9.1.1. But we are now on 1.11.1, which works fine on your input!

Thanks for the fast answer, John!