amyreese/markdown-pp

Choose different output encoding in Windows

Opened this issue · 3 comments

It seems that the output encoding in Windows is cp1252 by default which creates problems when the source files contain unicode characters if there is no suitable character defined in the charmap.

When I try to process a document containing the character '●' with MarkdownPP on Windows it exits with the following error:

Traceback (most recent call last):
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\Scripts\markdown-pp-script.py", line 11, in <module>
    load_entry_point('MarkdownPP==1.4', 'console_scripts', 'markdown-pp')()
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\site-packages\MarkdownPP\main.py", line 112, in main
    MarkdownPP.MarkdownPP(input=mdpp, output=md, modules=modules)
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\site-packages\MarkdownPP\MarkdownPP.py", line 28, in __init__
    pp.process()
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\site-packages\MarkdownPP\Processor.py", line 49, in process
    transforms = module.transform(self.data)
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\site-packages\MarkdownPP\Modules\Include.py", line 39, in transform
    includedata = self.include(match)
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\site-packages\MarkdownPP\Modules\Include.py", line 70, in include
    data[linenum:linenum+1] = self.include(match, dirname)
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\site-packages\MarkdownPP\Modules\Include.py", line 61, in include
    data = f.readlines()
  File "C:\Users\frank\AppData\Local\Programs\Python\Python37\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 6569: character maps to <undefined>

The same input can be processed fine using Linux.

Seems related to #53. I'm not familiar with how Python handles encodings on Windows, but on Linux, it uses the default encodings specified by the OS/environment.

Yes, that seems to be the same issue. The python 3 interpreter used the encoding returned by locale.getpreferredencoding() which on my Windows systems is cp-1252 and on my Linux system is UTF-8. My .mdpp files are encoded using UTF-8 and contain non-ACSII characters so python on Windows can't read them.
The fix described in #53 would work but only for python 3.

I might have a fix for this here
https://github.com/VincenzoLaSpesa/markdown-pp

I exposed the encoding parameter to the MarkdownPP class and now i can call it with:

MarkdownPP(input=infile, modules=['include', 'toc'], output=outfile, encoding="UTF8")

If no encoding is provided it's defaulted to sys.getdefaultencoding()

I will test it a little more and then i will open a merge request.