jgm/pandoc

Pandoc --version reports commitBuffer: invalid argument (cannot encode character '\248') on Windows if home folder includes non-ascii characters

llob opened this issue · 3 comments

llob commented

Using the latest version of Pandoc on Windows 10, executing pandoc --version results in the following output:

pandoc 3.1.13
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: C:\Users\Spandoc: : commitBuffer: invalid argument (cannot encode character '\248')

The problem appears to be, that my Windows home folder name contains the Danish character 'ø' (right after the 'S').

This is not a major issue, except the Python library "pandoc" calls pandoc --version on startup, to determine which version is installed, thus becoming effectively useless under these circumstances.

This occurs with version 3.1.13 of Pandoc on Windows 10.

jgm commented

What is the most recent version where this does not happen?

llob commented

I have tested a few versions, and it seems that the problem first occurred in version 3.0.
Here are the outputs from the latest 2.x version and the first 3.x version:

pandoc.exe 3.0
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: C:\Users\Spandoc.exe: <stdout>: commitBuffer: invalid argument (invalid character)
pandoc.exe 2.19.2
Compiled with pandoc-types 1.22.2.1, texmath 0.12.5.2, skylighting 0.13,
citeproc 0.8.0.1, ipynb 0.2, hslua 2.2.1
Scripting engine: Lua 5.4
User data directory: C:\Users\S├╕renBollOvergaard\AppData\Roaming\pandoc
Copyright (C) 2006-2022 John MacFarlane. Web:  https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.
jgm commented

Thanks for helping to pin that down.

I note that in 2.19.2 the user data directory doesn't appear correctly: the ø has been garbled.

This seems to be an issue about encodings. Unfortunately, I don't know much about how these things work on Windows systems. Do you have a working Haskell setup, by any chance, which would allow you to compile revised code and tell me if it helps?

I suspect the issue has to do with the putStr at
https://github.com/jgm/pandoc/blob/main/pandoc-cli/src/pandoc.hs#L97-L103
and might go away if we replace this with UTF8.putStr (import qualified Text.Pandoc.UTF8 as UTF8).