open-contracting/ocdskit

indent: Encoding issue on Windows

Closed this issue · 14 comments

Running the following commands on Linux results an utf-8 encoded file:

curl http://200.13.162.79/datosabiertos/HC1/HC1_datos_2020_json.zip > honduras.zip
unzip -o honduras.zip
ocdskit indent HC1_datos_2020.json

But running the equivalent commands on Windows results in an iso-8859-1 encoded file:

curl http://200.13.162.79/datosabiertos/HC1/HC1_datos_2020_json.zip > honduras.zip
tar -x -f honduras.zip
ocdskit indent HC1_datos_2020.json

On Windows, the output of more HC1_datos_2020.json differs before and after running ocdskit indent:

Before indenting, the output includes:

"name": "Secretaria de Salud P\u00fablica"

After indenting, the output includes:

"name": "Secretaria de Salud P�blica"

PYTHONIOENCODING is set to utf-8 and the terminal code page is set to 65001 (utf-8).

What precise version of Windows are you running? Apparently some things were fixed in Windows 10 October 2018 Update (build 1809).

How did you set the terminal code page? chcp 65001? Did you set LC_CTYPE=en_US.utf-8? What is the output of python -c "import sys; print(sys.stdout.encoding)"?

What precise version of Windows are you running? Apparently some things were fixed in Windows 10 October 2018 Update (build 1809).

Windows 10 Pro version 1909 OS build 18363.900

How did you set the terminal code page?

chcp 65001

Did you set LC_CTYPE=en_US.utf-8?

No. I tested again after setting it and got the same result.

What is the output of python -c "import sys; print(sys.stdout.encoding)"?

utf-8 (before and after setting LC_CTYPE)

After indenting, the output includes:

How are you reading the output? Can you upload it?

For troubleshooting purposes, I am reading the output using more, but originally I came across the issue because I tried to run cat HC1_datos_2020.json | ocdskit compile > honduras_compiled_releases.json after following the steps in this section of the OCDS Kit Learning Lab to download, extract and indent the file. ocdskit compile reported an encoding error and suggested trying --encoding iso-8859-1.

I've uploaded the file before indenting and after indenting.

Hmm, okay, I'll need to check how Windows determines the output encoding for the indent command. The behavior is a bit surprising.

@duncandewhurst Can you use ocdskit --ascii indent and re-upload the output? This will tell me if the issue is with the output encoding or with the internal representation.

Thanks for continuing to troubleshoot! Let me know if a screen-share would be helpful:

https://drive.google.com/file/d/1C9sTWf-4cv6auXJu5M-10YIl3s4zrb83/view?usp=sharing

I notice the latest document is identical to an earlier one, which is because the indent command ignores the --ascii option 🙃 I'll fix that first.

@duncandewhurst I've now fixed that on HEAD, if you can install from GitHub and run again.

Fascinating. The ASCII output correctly encodes the UTF-8 character (e.g. \u00f3 for ó). I guess that's thanks to Python, which does the encoding.

Anyway, looks like there's another magic environment variable to make Windows use UTF8 when reading/writing files like the rest of the world: PYTHONUTF8=1

https://dev.to/methane/python-use-utf-8-mode-on-windows-212i

Also, should have asked earlier, what version of Python is this? python --version

Anyway, looks like there's another magic environment variable to make Windows use UTF8 when reading/writing files like the rest of the world: PYTHONUTF8=1

Ah, that did the trick. I've added it to the learning lab instructions for Windows users.

Python version is 3.8.4

I've added that instruction to the docs. Closing.