nshepperd/gpt-2

Windows doesn't automatically use UTF-8 encoding

MrKrzYch00 opened this issue · 2 comments

Shouldn't all file operation have encoding="utf-8" added to make it more portable on other systems like Windows? Unless there is other global switch that could be applied at the beginning to not crash with a message "[...]charmap' codec can't encode character[...]"

kinoc commented

Would like an optional encoding flag, which defaults to "utf-8" but you could specify others. I have to use "latin-1" for some cases.

Yeah, I'm not yet 100% sure myself if it should be UTF-8 or one should use system-default encoding dataset instead of UTF-8 and open them as such... Trying to train it on Polish text to see the results. Unfortunately it doesn't want to use Polish accent letters, for example replaces ł with normal l with samples. Maybe I'm missing something or it still needs more training? (although it uses ó which usually exists in 1-byte encoding format)

EDIT: Never-mind the above... It seems that the console output is UTF-8 in my CMD which just simply doesn't work, it would need to be converted to ANSI using Polish code page before output, so in my case UTF-8 is most valid way to read datasets (without BOM!). Sample files look OK.