jgm/pandocfilters

Malformed text due to stdin incorrectly decoded (windows and py3)

sergiocorreia opened this issue · 2 comments

Problem

toJSONFilter runs sys.stdin.read() which on Python 3 (and Windows) causes problems due to Python making "an educated guess as to what encoding is used".

Also see the Python documentation:

The character encoding is platform-dependent. Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page. Under other platforms, the locale encoding is used

Example

In particular, this markdown code ends up malformed when going through a filter:

---
title: The Title
...

# One Section

Non-breaking space\ here\ and\ here.

Lorem ipsum dolor sit amet.

The reason is due to the non-breaking spaces, which Pandoc correctly converts to C2A0, which is then malformed by the Python filter with this line:

pandoc bug.md --filter=path/to/filter/abc.py --to=json

You can use abc.py or even a trivial filter that does nothing:

#!/usr/bin/env python

import os
from pandocfilters import toJSONFilter

def relax(key, value, format, meta):
    return

if __name__ == "__main__":
    toJSONFilter(relax)

Solution

I followed the suggestions of the stackoverflow thread and replaced this line:

doc = json.loads(sys.stdin.read())

by these lines:

input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
doc = json.loads(input_stream.read())

(Plus import io). This seemed to fix the problem.

If it's easier for you, I can fork it and submit a proper pull request.

Best,
Sergio

Computer Details

  • Windows 7
  • Python 3.4.3
  • Pandoc 1.15 (Compiled with texmath 0.8.2, highlighting-kate 0.6.)
jgm commented

Yes, thanks, a PR would be welcome.

+++ Sergio Correia [Jul 09 15 23:45 ]:

Problem

toJSONFilter runs sys.stdin.read() which on Python 3 (and Windows)
causes problems due to Python making [1]"an educated guess as to what
encoding is used".

Also see the [2]Python documentation:

The character encoding is platform-dependent. Under Windows, if the
stream is interactive (that is, if its isatty() method returns
True), the console codepage is used, otherwise the ANSI code page.
Under other platforms, the locale encoding is used

Example

In particular, this markdown code ends up malformed when going through
a filter:


title: The Title
...

One Section

Non-breaking space\ here\ and\ here.

Lorem ipsum dolor sit amet.

The reason is due to the non-breaking spaces, which Pandoc correctly
converts to [3]C2A0, which is then malformed by the Python filter with
this line:
pandoc bug.md --filter=path/to/filter/abc.py --to=json

You can use abc.py or even a trivial filter that does nothing:

#!/usr/bin/env python

import os
from pandocfilters import toJSONFilter

def relax(key, value, format, meta):
return

if name == "main":
toJSONFilter(relax)

Solution

I followed the suggestions of the [4]stackoverflow thread and replaced
this line:
doc = json.loads(sys.stdin.read())

by these lines:
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
doc = json.loads(input_stream.read())

(Plus import io). This seemed to fix the problem.

If it's easier for you, I can fork it and submit a proper pull request.

Best,
Sergio

Computer Details

* Windows 7
* Python 3.4.3
* Pandoc 1.15 (Compiled with texmath 0.8.2, highlighting-kate 0.6.)


Reply to this email directly or [5]view it on GitHub.

References

  1. http://stackoverflow.com/a/16549381/3977107
  2. https://docs.python.org/2/library/sys.html#sys.stdin
  3. http://www.fileformat.info/info/unicode/char/00a0/index.htm
  4. http://stackoverflow.com/questions/16549332/python-3-how-to-specify-stdin-encoding
  5. #21

Done!