jgm/pandoc

Markdown: Awkward soft break after abbreviation between ( and newline

fiapps opened this issue · 3 comments

Test case:

echo '(cf.
Foo)' | pandoc -f markdown -t markdown

Output: ( cf. Foo).

A space has been added after the open parenthesis. More precisely, if the native output format is chosen, we see it's a SoftBreak: [Para [Str "(",SoftBreak,Str "cf.\160Foo)"]].

This is a sufficiently rare case that it only occurred once in a 350 page document.

This is actually a pretty common bug if you used hard line wrapping in your source document. It produces the error any time a line in your source document ends in an abbreviation prefixed by parenthesis:

Lorem (e.g.
ipsum)

produces output

Lorem ( e.g. ipsum)

I hard wrap at 78 characters in my source documents. On average for me, this produces ~3 errors per 8,000 words and of course it affects all output formats.

mb21 commented

A possible workaround is to use --abbreviations=/dev/null (or another empty file)

jgm commented

Here's the relevant code (in str in the Markdown reader):

      abbrevs <- getOption readerAbbreviations
      if not (null result) && last result == '.' && result `Set.member` abbrevs
         then try (do ils <- whitespace <|> endline
                      lookAhead alphaNum
                      return $ do
                        ils' <- ils
                        if ils' == B.space
                           then return (B.str result <> B.str "\160")
                           else -- linebreak or softbreak
                                return (ils' <> B.str result <> B.str "\160"))
                <|> return (return (B.str result))
         else return (return (B.str result)))

The logic is this: when an abbreviation is followed by a space, we replace it by a nonbreaking space. When it is followed by a line break (soft or hard), we replace it by a nonbreaking space and move the line break before the abbreviation. That gives bad results when the abbreviation isn't itself preceded by a space.