Thomasdezeeuw/ini

Error reporting interprets input data as raw bytes, not UTF-8-encoded text

Closed this issue · 3 comments

While playing with testing your package I got it stumbled on the string which began like this:

const s = `
# Настройки "продавливания" информации в вебсервис:
[webservice]
    URL = "http://webservice.domain.local/blah/blah"
...

(Notice Cyrillic characters in a supposed-to-be comment.)

I've incorrectly guessed the comment leader character (it had to be ;), but the error message about that was a bit "broken":

ini: synthax error on line 2: unexpected "Ð", expected the seperator "="

The problem here is that it displayed the byte at the point it expected a separator to be found, but there was a Cyrillic character (which occupy two bytes in UTF-8).

Since apparently the package assumes UTF-8 for the encoding of its keys and values, it might be sensible to try processing the data as a series of Unicode runes, not bytes.

I'm not sure if it really feasible or might break some other use-cases, so I don't consider this issue a problem but rather a thing to ponder about.

To be honest I really didn't anticipate any multi-byte characters. I do have a question however, are multi-byte characters (don't know the correct name and google didn't help) only used in comments, or are the also common in keys and/or values?

As for the error message, it is clearly wrong and doesn't give any pointer to where the actual problem lays. This needs improvement.

Runes would be the way to go if full UTF-8 supported is the way to go. However the problem with runes is that is will slow down the parsing quite a bit. I don't have the source handy, but another parsing package switched from runes and bytes and saw about a 20% speedup.

I will take a loot this and see if the performance hit is worth supporting UTF-8.

Thanks for the issue report and the digging into the problem.

Well, the simple thing is that when you do var b []byte; ...; s := string(b) in Go, your s contains UTF-8 unless you explicitly interpret it in some other way (which is possible, of course).

I mean, using the range operator over any string is defined to traverse Unicode runes of that string, not bytes (conversely, indexing a string like s[i] accesss individual bytes).

Due to that, your package is already UTF-8-aware (keys, values work just OK).

As to whether you should support Unicode for keys and values, I'd say:

  1. Definitely;
  2. ...and you already do that.
  3. So may be by now it just worth mentioning in the docs that the package expects its input to be encoded in UTF-8, and that it produces UTF-8 when writing.

I fixed the weird error messages, but only in the error messages. If you find any more issues UTF-8 (or anything else for that matter) please let me know.