vubiostat/r-yaml

utf-8 encoding after yaml.load_file

HenricoWitvliet opened this issue · 21 comments

I've got a file with utf-8 characters. yaml.load_file loads the character strings correctly. But the encoding, as given by Encoding(), returns unknown. Now I use Encoding(...) <-'UTF-8' to set the encoding.
It would be nice if the character strings had the utf-8 encoding bit set.

same problem

This same behavior occurs when using R core functions like readLines, at least in Linux. As far as I know, R does not do any kind of encoding detection. If you run example(Encoding), what is your output?

Since a yaml file is encoded in unicode, I would expect strings to be given this encoding. The character string that yaml.load_file returns in my example is utf-8 encoded. But I haven't tried an example yaml in utf-16, so I don't know if setting a bit in every string would be enough.

Ah, I see. I didn't realize that all YAML documents are unicode, but the YAML specification agrees with you. The specification says that by default, the encoding is UTF-8. For UTF-16, the document must provide a byte-order mark:
http://yaml.org/spec/1.1/#id868742

It looks like LibYAML has an encoding property:
http://pyyaml.org/wiki/LibYAML#StylisticEventAttributes

I'll add this into the next update.

As it turns out, R does not support UTF-16 at all in Encoding() as of version 3.0.2.

yihui commented

We just ran into the same problem. It will be nice if you can explicitly mark the encoding of character strings as UTF-8. Thanks! (We probably do not need to worry about UTF-16)

I had forgotten about this issue, unfortunately. I will take a fresh look at it.

yihui commented

Thanks! FWIW, this is our current workaround: rstudio/rmarkdown#421 (Recursively mark the character elements of yaml.load() output as UTF-8)

There seem to be two issues here, one with yaml.load_file and another with yaml.load.

When yaml.load_file calls readLines without explicitly defining the encoding as UTF-8, the contents of a valid UTF-8 encoded yaml file is read into a string with the encoding set to unknown (while in fact being UTF-8). On Windows, R treats the string as latin1 (I guess) so the characters are all garbled when displayed. By adding encoding="UTF-8 as a parameter to readLines the raw text input is read correctly and set as UTF-8 before being passed on to yaml.load.

While I suggest setting encoding="UTF-8 parameter for readLines in yaml.load_file it does not seem to be enough to fix the problem. Once yaml.load starts processing the text read by readLines, it messes the characters up again by reverting the encoding to unknown.

yihui commented

We were bitten by this issue again: rstudio/bookdown#142 Is there a chance that you could fix it? The fix should be fairly simple (mark the input and output strings as UTF-8), and I'm just not familiar with C.

We encountered the same issue as well, although it can by solved as @yihui did in https://github.com/rstudio/bookdown/blob/3ed7fc6bd30e2832948d28298dee5cd546339fc8/R/utils.R#L82

We thought it would be nicer if it's fixed in the package yaml.

Thanks.

yihui commented

And bitten by this again rstudio/rmarkdown#841 so yet yet another patch...

Unfortunately I have precious little time to work on this project at present. A pull request would be appreciated.

yihui commented

@viking Okay, actually that is all I need from you. I'll try to find someone to do the work and submit a pull request. Thanks!

yihui commented

@viking Done in #32. Tested on Windows and *nix.

In the long run, if you feel it is difficult for you to maintain this package, you may consider finding a new maintainer. It seems you are having the similar situation of the tikzDevice package, which is a package that I was highly interested in but the original authors lacked time. The yaml package is critical to the R Markdown world, and I hope you could consider increasing the bus factor so this important project can be carried forward nicely in the future.

BTW, I found this article very inspiring: I gave commit rights to someone I didn't know, I could never have guessed what happened next!.

Thank you.

yihui commented

@viking Any chance you could make a CRAN release soon? I hate bugging you like this, but without the CRAN release, we just keep hearing users report this issue. Here again: http://rmarkdown.rstudio.com/r_notebooks.html#comment-2982649887

I'll get to it soon. Not being funded to do this means that I have other priorities. Please recognize that.

I don't wish to continue this discussion here. I will let you know when the new version is on CRAN.

yihui commented

Yep definitely understood, and much appreciated!

New version is up on CRAN as of about 10 minutes ago.