Use utf-8 encoding indenpendent of the system locale
Closed this issue · 5 comments
By the documentation of Data.Text.IO.readFile
, unexpected system locale will cause invalid argument (invalid byte sequence)
error.
In fact, on my computer whose locale is gb2312, and typst-hs
will throw that error if input file is in gb2312, or give wrong character if input file is utf-8.
(I can change my locale, but this is...not robust.)
Maybe one need System.IO.hSetEncoding
or something similar.
Is it documented that typst requires UTF-8?
I don't know, sorry. But Data.Text.readFile
can't handle non utf-8 files correctly either, so it must be a bug.
Here's the documentation for Data.Text.readFile
:
Beware that this function (similarly to readFile) is locale-dependent. Unexpected system locale may cause your application to read corrupted data or throw runtime exceptions about "invalid argument (invalid byte sequence)" or "invalid argument (invalid character)". This is also slow, because GHC first converts an entire input to UTF-32, which is afterwards converted to UTF-8.
It's locale-dependent, so I would have thought its expected behavior would be to use your locale's encoding. I would expect that it would produce bad results with your locale and a UTF-8 encoded text (that is not a bug). What is unexpected (and maybe a bug in readFile
?) is if it cannot properly handle files encoded in your system locale's encoding.
Looking at the code, the typst command line tool seems to presuppose that its input is UTF-8 encoded, so it might? make sense for us to do the same rather than using Data.Text.readFile
.
From what the docs say, it would be more efficient to read input as a bytestring and then decode UTF-8 than to use readFile
+ hSetEncoding
.