jgm/typst-hs

Use utf-8 encoding indenpendent of the system locale

Closed this issue · 5 comments

By the documentation of Data.Text.IO.readFile, unexpected system locale will cause invalid argument (invalid byte sequence) error.
In fact, on my computer whose locale is gb2312, and typst-hs will throw that error if input file is in gb2312, or give wrong character if input file is utf-8.
(I can change my locale, but this is...not robust.)

Maybe one need System.IO.hSetEncoding or something similar.

jgm commented

Is it documented that typst requires UTF-8?

I don't know, sorry. But Data.Text.readFile can't handle non utf-8 files correctly either, so it must be a bug.

jgm commented

Here's the documentation for Data.Text.readFile:

Beware that this function (similarly to readFile) is locale-dependent. Unexpected system locale may cause your application to read corrupted data or throw runtime exceptions about "invalid argument (invalid byte sequence)" or "invalid argument (invalid character)". This is also slow, because GHC first converts an entire input to UTF-32, which is afterwards converted to UTF-8.

It's locale-dependent, so I would have thought its expected behavior would be to use your locale's encoding. I would expect that it would produce bad results with your locale and a UTF-8 encoded text (that is not a bug). What is unexpected (and maybe a bug in readFile?) is if it cannot properly handle files encoded in your system locale's encoding.

jgm commented

Looking at the code, the typst command line tool seems to presuppose that its input is UTF-8 encoded, so it might? make sense for us to do the same rather than using Data.Text.readFile.

jgm commented

From what the docs say, it would be more efficient to read input as a bytestring and then decode UTF-8 than to use readFile + hSetEncoding.