Malformed text due to stdin incorrectly decoded (windows and py3)
sergiocorreia opened this issue · 2 comments
Problem
toJSONFilter
runs sys.stdin.read()
which on Python 3 (and Windows) causes problems due to Python making "an educated guess as to what encoding is used".
Also see the Python documentation:
The character encoding is platform-dependent. Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page. Under other platforms, the locale encoding is used
Example
In particular, this markdown code ends up malformed when going through a filter:
---
title: The Title
...
# One Section
Non-breaking space\ here\ and\ here.
Lorem ipsum dolor sit amet.
The reason is due to the non-breaking spaces, which Pandoc correctly converts to C2A0, which is then malformed by the Python filter with this line:
pandoc bug.md --filter=path/to/filter/abc.py --to=json
You can use abc.py or even a trivial filter that does nothing:
#!/usr/bin/env python
import os
from pandocfilters import toJSONFilter
def relax(key, value, format, meta):
return
if __name__ == "__main__":
toJSONFilter(relax)
Solution
I followed the suggestions of the stackoverflow thread and replaced this line:
doc = json.loads(sys.stdin.read())
by these lines:
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
doc = json.loads(input_stream.read())
(Plus import io
). This seemed to fix the problem.
If it's easier for you, I can fork it and submit a proper pull request.
Best,
Sergio
Computer Details
- Windows 7
- Python 3.4.3
- Pandoc 1.15 (Compiled with texmath 0.8.2, highlighting-kate 0.6.)
Yes, thanks, a PR would be welcome.
+++ Sergio Correia [Jul 09 15 23:45 ]:
Problem
toJSONFilter runs sys.stdin.read() which on Python 3 (and Windows)
causes problems due to Python making [1]"an educated guess as to what
encoding is used".Also see the [2]Python documentation:
The character encoding is platform-dependent. Under Windows, if the stream is interactive (that is, if its isatty() method returns True), the console codepage is used, otherwise the ANSI code page. Under other platforms, the locale encoding is used
Example
In particular, this markdown code ends up malformed when going through
a filter:
title: The Title
...One Section
Non-breaking space\ here\ and\ here.
Lorem ipsum dolor sit amet.
The reason is due to the non-breaking spaces, which Pandoc correctly
converts to [3]C2A0, which is then malformed by the Python filter with
this line:
pandoc bug.md --filter=path/to/filter/abc.py --to=jsonYou can use abc.py or even a trivial filter that does nothing:
#!/usr/bin/env python
import os
from pandocfilters import toJSONFilterdef relax(key, value, format, meta):
returnif name == "main":
toJSONFilter(relax)Solution
I followed the suggestions of the [4]stackoverflow thread and replaced
this line:
doc = json.loads(sys.stdin.read())by these lines:
input_stream = io.TextIOWrapper(sys.stdin.buffer, encoding='utf-8')
doc = json.loads(input_stream.read())(Plus import io). This seemed to fix the problem.
If it's easier for you, I can fork it and submit a proper pull request.
Best,
SergioComputer Details
* Windows 7 * Python 3.4.3 * Pandoc 1.15 (Compiled with texmath 0.8.2, highlighting-kate 0.6.)
—
Reply to this email directly or [5]view it on GitHub.References
Done!