utf-8 encoding after yaml.load_file
HenricoWitvliet opened this issue · 21 comments
I've got a file with utf-8 characters. yaml.load_file loads the character strings correctly. But the encoding, as given by Encoding(), returns unknown. Now I use Encoding(...) <-'UTF-8' to set the encoding.
It would be nice if the character strings had the utf-8 encoding bit set.
same problem
This same behavior occurs when using R core functions like readLines
, at least in Linux. As far as I know, R does not do any kind of encoding detection. If you run example(Encoding)
, what is your output?
Since a yaml file is encoded in unicode, I would expect strings to be given this encoding. The character string that yaml.load_file returns in my example is utf-8 encoded. But I haven't tried an example yaml in utf-16, so I don't know if setting a bit in every string would be enough.
Ah, I see. I didn't realize that all YAML documents are unicode, but the YAML specification agrees with you. The specification says that by default, the encoding is UTF-8. For UTF-16, the document must provide a byte-order mark:
http://yaml.org/spec/1.1/#id868742
It looks like LibYAML has an encoding property:
http://pyyaml.org/wiki/LibYAML#StylisticEventAttributes
I'll add this into the next update.
As it turns out, R does not support UTF-16 at all in Encoding()
as of version 3.0.2.
We just ran into the same problem. It will be nice if you can explicitly mark the encoding of character strings as UTF-8. Thanks! (We probably do not need to worry about UTF-16)
I had forgotten about this issue, unfortunately. I will take a fresh look at it.
Thanks! FWIW, this is our current workaround: rstudio/rmarkdown#421 (Recursively mark the character elements of yaml.load()
output as UTF-8)
There seem to be two issues here, one with yaml.load_file
and another with yaml.load
.
When yaml.load_file
calls readLines
without explicitly defining the encoding as UTF-8, the contents of a valid UTF-8 encoded yaml file is read into a string with the encoding set to unknown (while in fact being UTF-8). On Windows, R treats the string as latin1 (I guess) so the characters are all garbled when displayed. By adding encoding="UTF-8
as a parameter to readLines
the raw text input is read correctly and set as UTF-8 before being passed on to yaml.load
.
While I suggest setting encoding="UTF-8
parameter for readLines
in yaml.load_file
it does not seem to be enough to fix the problem. Once yaml.load
starts processing the text read by readLines
, it messes the characters up again by reverting the encoding to unknown
.
We were bitten by this issue again: rstudio/bookdown#142 Is there a chance that you could fix it? The fix should be fairly simple (mark the input and output strings as UTF-8), and I'm just not familiar with C.
We encountered the same issue as well, although it can by solved as @yihui did in https://github.com/rstudio/bookdown/blob/3ed7fc6bd30e2832948d28298dee5cd546339fc8/R/utils.R#L82
We thought it would be nicer if it's fixed in the package yaml
.
Thanks.
And bitten by this again rstudio/rmarkdown#841 so yet yet another patch...
Unfortunately I have precious little time to work on this project at present. A pull request would be appreciated.
@viking Okay, actually that is all I need from you. I'll try to find someone to do the work and submit a pull request. Thanks!
@viking Done in #32. Tested on Windows and *nix.
In the long run, if you feel it is difficult for you to maintain this package, you may consider finding a new maintainer. It seems you are having the similar situation of the tikzDevice package, which is a package that I was highly interested in but the original authors lacked time. The yaml package is critical to the R Markdown world, and I hope you could consider increasing the bus factor so this important project can be carried forward nicely in the future.
BTW, I found this article very inspiring: I gave commit rights to someone I didn't know, I could never have guessed what happened next!.
Thank you.
@viking Any chance you could make a CRAN release soon? I hate bugging you like this, but without the CRAN release, we just keep hearing users report this issue. Here again: http://rmarkdown.rstudio.com/r_notebooks.html#comment-2982649887
I'll get to it soon. Not being funded to do this means that I have other priorities. Please recognize that.
I don't wish to continue this discussion here. I will let you know when the new version is on CRAN.
Yep definitely understood, and much appreciated!
New version is up on CRAN as of about 10 minutes ago.