yihui/xfun

spin fails with encoding error although the file is encoded as utf-8

tomtom opened this issue · 6 comments

I have a file that I convert to html using (a customized version of) spin(). The file is utf8 encoded. During conversion I get the following errors though:

First:

Warning in read_utf8(input) :
  The file ???.Rmd is not encoded in UTF-8. These lines contain invalid UTF-8 characters: 76, 87, 94, 98, 103, 107, ...
Calls: system.time ... knit2html -> grep -> head -> read_utf8 -> <Anonymous>
Warning in knit(input, text = text, envir = envir, quiet = quiet) :
  The file "???.Rmd" should be encoded in UTF-8. Now I will try to read it with the system native encoding (which may not be correct). We will only support UTF-8 in the near future. Please see https://yihui.name/en/2018/11/biggest-regret-knitr/ for more info.

And later on:

<simpleError in xfun::read_utf8(file, error = TRUE): The file ???.md is not encoded in UTF-8. These lines contain invalid UTF-8 characters: 116, 117, 130, 131, 147, 148, ...>

I verified that both, the Rmd and the md file, are actually utf-8 encoded. I'm rather clueless what to do.

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=German_Austria.1252  LC_CTYPE=German_Austria.1252    LC_MONETARY=German_Austria.1252
[4] LC_NUMERIC=C                    LC_TIME=German_Austria.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] extrafont_0.17  reshape2_1.4.3  lubridate_1.7.4 stringr_1.4.0   ggplot2_3.2.0   dplyr_0.8.3     assertive_0.3-5
[8] memoise_1.1.0   RODBC_1.3-15   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.1                 assertive.data_0.0-3       plyr_1.8.4                 compiler_3.6.1            
 [5] pillar_1.4.2               assertive.files_0.0-2      assertive.properties_0.0-4 tools_3.6.1               
 [9] assertive.data.us_0.0-2    zeallot_0.1.0              digest_0.6.20              assertive.base_0.0-7      
[13] tibble_2.1.3               gtable_0.3.0               pkgconfig_2.0.3            rlang_0.4.0               
[17] cli_1.1.0                  xfun_0.10                  Rttf2pt1_1.3.7             withr_2.1.2               
[21] knitr_1.23                 vctrs_0.2.0                assertive.strings_0.0-3    assertive.sets_0.0-3      
[25] assertive.types_0.0-3      assertive.datetimes_0.0-2  assertive.matrices_0.0-2   grid_3.6.1                
[29] tidyselect_0.2.5           glue_1.3.1                 assertive.code_0.0-3       R6_2.4.0                  
[33] fansi_0.4.0                extrafontdb_1.0            purrr_0.3.2                magrittr_1.5              
[37] backports_1.1.5            assertive.numbers_0.0-2    scales_1.0.0               codetools_0.2-16          
[41] assertive.models_0.0-2     assertthat_0.2.1           colorspace_1.4-1           labeling_0.3              
[45] utf8_1.1.4                 stringi_1.4.3              assertive.data.uk_0.0-2    lazyeval_0.2.2            
[49] munsell_0.5.0              crayon_1.3.4               assertive.reflection_0.0-4

And encoding is set to utf8:

options(
        ...
        , encoding = "UTF-8"
)

I post it here because this seems to be a problem with xfun::read_utf8.

Regards,
Tom

yihui commented

Please always provide a reproducible example when reporting issues: https://yihui.name/issue/

And I strongly recommend that you do not set options(encoding = "UTF-8"). This option often does more harm than good.

I'll try to come up with an example.

You have to set encoding = "UTF-8" on Windows because otherwise, R assumes the standard windows encoding, which is not UTF-8 -- and cannot be changed in an corporate environment.

yihui commented

Usually you don't need to set options(encoding). If you have to, there must be something else that is wrong. As I said, setting this option often does more harm than good.

When I don't set encoding, R expects non-utf-8 and causes problems at other places when reading / sourcing utf-8 encoded code. If you want people to use utf8, you have to support it all the way down.

Let's assume the following utf8-encoded file foo.R:

#' # Foo
print("äöü")

This works as expected:

library("knitr")
options(encoding = "native.enc")
spin("foo.R")

This displays a warning:

options(encoding = "UTF-8")
spin("foo.R")

I admit handling encodings can be a pain and in R especially so.

Regards

A little off-topic but as a self-claimed "experienced user in R & encoding & Windows", my advice is to avoid setting up options(encoding = "UTF-8").

It only saves you a few key strokes for source() (with RStudio you can simply click a button or you can create a handy srcutf8() function in ~/.Rprofile which will be available whenever you need it).

The harm is too much because this option is used by many base functions. Changing it may cause many strange problems. Moreover, the encoding issue is very tricky and there are some complicated C-level / R-internal historical issues that are too difficult to solve. You may solve it here and it will bump out somewhere else later, costing you countless time and effort. At that time you may ask why not just avoid to set this option in the first place?

Okay, this seems to work somewhat better. Thank you for this very useful tip.