indent: Encoding issue on Windows
Closed this issue · 14 comments
Running the following commands on Linux results an utf-8 encoded file:
curl http://200.13.162.79/datosabiertos/HC1/HC1_datos_2020_json.zip > honduras.zip
unzip -o honduras.zip
ocdskit indent HC1_datos_2020.json
But running the equivalent commands on Windows results in an iso-8859-1 encoded file:
curl http://200.13.162.79/datosabiertos/HC1/HC1_datos_2020_json.zip > honduras.zip
tar -x -f honduras.zip
ocdskit indent HC1_datos_2020.json
On Windows, the output of more HC1_datos_2020.json
differs before and after running ocdskit indent
:
Before indenting, the output includes:
"name": "Secretaria de Salud P\u00fablica"
After indenting, the output includes:
"name": "Secretaria de Salud P�blica"
PYTHONIOENCODING
is set to utf-8
and the terminal code page is set to 65001
(utf-8).
What precise version of Windows are you running? Apparently some things were fixed in Windows 10 October 2018 Update (build 1809).
How did you set the terminal code page? chcp 65001
? Did you set LC_CTYPE=en_US.utf-8
? What is the output of python -c "import sys; print(sys.stdout.encoding)"
?
What precise version of Windows are you running? Apparently some things were fixed in Windows 10 October 2018 Update (build 1809).
Windows 10 Pro version 1909 OS build 18363.900
How did you set the terminal code page?
chcp 65001
Did you set
LC_CTYPE=en_US.utf-8
?
No. I tested again after setting it and got the same result.
What is the output of
python -c "import sys; print(sys.stdout.encoding)"
?
utf-8 (before and after setting LC_CTYPE)
After indenting, the output includes:
How are you reading the output? Can you upload it?
For troubleshooting purposes, I am reading the output using more
, but originally I came across the issue because I tried to run cat HC1_datos_2020.json | ocdskit compile > honduras_compiled_releases.json
after following the steps in this section of the OCDS Kit Learning Lab to download, extract and indent the file. ocdskit compile
reported an encoding error and suggested trying --encoding iso-8859-1
.
I've uploaded the file before indenting and after indenting.
Hmm, okay, I'll need to check how Windows determines the output encoding for the indent command. The behavior is a bit surprising.
@duncandewhurst Can you use ocdskit --ascii indent
and re-upload the output? This will tell me if the issue is with the output encoding or with the internal representation.
Thanks for continuing to troubleshoot! Let me know if a screen-share would be helpful:
https://drive.google.com/file/d/1C9sTWf-4cv6auXJu5M-10YIl3s4zrb83/view?usp=sharing
I notice the latest document is identical to an earlier one, which is because the indent command ignores the --ascii
option 🙃 I'll fix that first.
@duncandewhurst I've now fixed that on HEAD, if you can install from GitHub and run again.
I've uploaded the new version to the same URL: https://drive.google.com/file/d/1C9sTWf-4cv6auXJu5M-10YIl3s4zrb83/view?usp=sharing
Fascinating. The ASCII output correctly encodes the UTF-8 character (e.g. \u00f3
for ó
). I guess that's thanks to Python, which does the encoding.
Anyway, looks like there's another magic environment variable to make Windows use UTF8 when reading/writing files like the rest of the world: PYTHONUTF8=1
https://dev.to/methane/python-use-utf-8-mode-on-windows-212i
Also, should have asked earlier, what version of Python is this? python --version
Anyway, looks like there's another magic environment variable to make Windows use UTF8 when reading/writing files like the rest of the world:
PYTHONUTF8=1
Ah, that did the trick. I've added it to the learning lab instructions for Windows users.
Python version is 3.8.4
I've added that instruction to the docs. Closing.