sfirke/janitor

Inconsistent treatment of unicode character in RStudio vs quarto

richardjtelford opened this issue · 4 comments

This is a problem that affects some of the students with windows computers in my class.

The students are importing an excel file that contains a unicode character \u2103 (℃) in the header row. They are then using janitor::clean_names().

For most of the students janitor::clean_names() converts the column name to "temperature_c" in both Rstudio and when rendering with quarto.
For about 20% of the students, janitor::clean_names() converts the "℃" to "temperature_u_00b0_c" (the unicode for "°") in Rstudio but to "temperature_c" when rendered with quarto. This then causes problems with the rest of their code when they render the document

In both rstudio and quarto the "℃" is being imported correctly as utf-8 and has the same output with charToRaw() - e2 84 83, so it is not an import problem. Somehow janitor is treating the unicode differently depending on how R is being run.

All the affected students are using R4.2.1 with the current version of RStudio on windows. Students might have Norwegian locales - I haven't been able to check that.

Minimal example (but it might work correctly for you)

tibble::tibble("Temperature (℃)" = 1) |> janitor::clean_names() |> names()
#temperature_u_00b0_c 

Can you please share the sessionInfo() from one system with the extra text and one without?

Having different language settings can change the way that characters are rendered and converted. We have worked hard to make it consistent, but it seems another case has snuck through.

Sometimes, other libraries will do some degree of transliteration before it gets to clean_names(), too. What library and function are you using to load the excel file?

We are importing the data with readxl::read_excel().

Here is the sessionInfo from one student where unicode went wrong. I think the other students with the problem had the same locale.

R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)


Matrix products: default

 

locale:

[1] C

system code page: 65001

 

attached base packages:

[1] stats     graphics  grDevices utils     datasets  methods   base    

 

other attached packages:

[1] janitor_2.1.0   readxl_1.4.1    forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.4     readr_2.1.2     tidyr_1.2.1   
 [9] tibble_3.1.8    ggplot2_3.3.6   tidyverse_1.3.2 here_1.0.1      tidylog_1.0.2 

 

loaded via a namespace (and not attached):

[1] Rcpp_1.0.9          lubridate_1.8.0     lattice_0.20-45     class_7.3-20        clisymbols_1.2.0    digest_0.6.29       assertthat_0.2.1  
 [8] rprojroot_2.0.3     utf8_1.2.2          R6_2.5.1            cellranger_1.1.0    backports_1.4.1     reprex_2.0.2        evaluate_0.16     
[15] e1071_1.7-11        httr_1.4.4          pillar_1.8.1        rlang_1.0.5         googlesheets4_1.0.1 rstudioapi_0.14     Matrix_1.5-1      
[22] rmarkdown_2.16      labeling_0.4.2      splines_4.2.1       googledrive_2.0.0   munsell_0.5.0       proxy_0.4-27        broom_1.0.1       
[29] compiler_4.2.1      modelr_0.1.9        xfun_0.34           pkgconfig_2.0.3     mgcv_1.8-40         htmltools_0.5.3     tidyselect_1.1.2  
[36] fansi_1.0.3         crayon_1.5.1        tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0         grid_4.2.1          nlme_3.1-157      
[43] jsonlite_1.8.0      gtable_0.3.1        lifecycle_1.0.2     DBI_1.1.3           magrittr_2.0.3      units_0.8-0         scales_1.2.1      
[50] KernSmooth_2.23-20  cli_3.4.0           stringi_1.7.8       farver_2.1.1        fs_1.5.2            snakecase_0.11.0    xml2_1.3.3        
[57] ellipsis_0.3.2      generics_0.1.3      vctrs_0.4.1         tools_4.2.1         glue_1.6.2          hms_1.1.2           yaml_2.3.5        
[64] fastmap_1.1.0       colorspace_2.0-3    gargle_1.2.0        classInt_0.4-8      rvest_1.0.3         knitr_1.40          haven_2.5.1   

Here is the session info from a student who had it working as expected

> sessionInfo()
R version 4.2.1 (2022-06-23 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.utf8  LC_CTYPE=English_United Kingdom.utf8  
[3] LC_MONETARY=English_United Kingdom.utf8 LC_NUMERIC=C                          
[5] LC_TIME=English_United Kingdom.utf8    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] ggfortify_0.4.14 gt_0.7.0         readxl_1.4.1     broom_1.0.1      forcats_0.5.2    stringr_1.4.1  
 [7] dplyr_1.0.9      purrr_0.3.4      readr_2.1.2      tidyr_1.2.0      tibble_3.1.8     ggplot2_3.3.6  
[13] tidyverse_1.3.2

loaded via a namespace (and not attached):
 [1] lubridate_1.8.0     assertthat_0.2.1    digest_0.6.29       utf8_1.2.2          R6_2.5.1          
 [6] cellranger_1.1.0    backports_1.4.1     reprex_2.0.2        evaluate_0.16       httr_1.4.4        
[11] highr_0.9           pillar_1.8.1        rlang_1.0.4         googlesheets4_1.0.1 rstudioapi_0.14    
[16] car_3.1-0           labeling_0.4.2      googledrive_2.0.0   bit_4.0.4           munsell_0.5.0      
[21] compiler_4.2.1      modelr_0.1.9        janitor_2.1.0       xfun_0.32           pkgconfig_2.0.3    
[26] htmltools_0.5.3     tidyselect_1.1.2    gridExtra_2.3       fansi_1.0.3         crayon_1.5.1      
[31] tzdb_0.3.0          dbplyr_2.2.1        withr_2.5.0         grid_4.2.1          jsonlite_1.8.0    
[36] gtable_0.3.0        lifecycle_1.0.1     DBI_1.1.3           magrittr_2.0.3      scales_1.2.1      
[41] cli_3.4.1           stringi_1.7.8       vroom_1.5.7         carData_3.0-5       farver_2.1.1      
[46] fs_1.5.2            snakecase_0.11.0    xml2_1.3.3          ellipsis_0.3.2      generics_0.1.3    
[51] vctrs_0.4.1         RColorBrewer_1.1-3  tools_4.2.1         bit64_4.0.5         glue_1.6.2        
[56] hms_1.1.2           fastmap_1.1.0       abind_1.4-5         parallel_4.2.1      colorspace_2.0-3  
[61] gargle_1.2.0        rvest_1.0.3         knitr_1.40          haven_2.5.1         sass_0.4.2  

I was looking at this again today, and it's an even thornier problem than I first thought. When I tried it on my system (Windows 11 with US English locale), I get a simple lowercase c (see below). But, I think that a workaround can be using the replace argument:

tibble::tibble("Temperature (℃)" = 1) |> janitor::clean_names(replace = c("\u2103" = "deg c")) |> names()
degC <- rawToChar(as.raw(c(0xe2, 0x84, 0x83)))
degC
#> [1] "℃"
janitor::make_clean_names(degC)
#> [1] "c"

Created on 2022-12-01 with reprex v2.0.2

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.2.1 (2022-06-23 ucrt)
#>  os       Windows 10 x64 (build 22621)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_United States.utf8
#>  ctype    English_United States.utf8
#>  tz       America/New_York
#>  date     2022-12-01
#>  pandoc   2.19.2 @ C:/Program Files/RStudio/resources/app/bin/quarto/bin/tools/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  assertthat    0.2.1   2019-03-21 [1] CRAN (R 4.2.1)
#>  cli           3.4.1   2022-09-23 [1] CRAN (R 4.2.1)
#>  DBI           1.1.3   2022-06-18 [1] CRAN (R 4.2.1)
#>  digest        0.6.29  2021-12-01 [1] CRAN (R 4.2.1)
#>  dplyr         1.0.10  2022-09-01 [1] CRAN (R 4.2.1)
#>  evaluate      0.18    2022-11-07 [1] CRAN (R 4.2.2)
#>  fansi         1.0.3   2022-03-24 [1] CRAN (R 4.2.1)
#>  fastmap       1.1.0   2021-01-25 [1] CRAN (R 4.2.1)
#>  fs            1.5.2   2021-12-08 [1] CRAN (R 4.2.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.2.1)
#>  glue          1.6.2   2022-02-24 [1] CRAN (R 4.2.1)
#>  highr         0.9     2021-04-16 [1] CRAN (R 4.2.1)
#>  htmltools     0.5.3   2022-07-18 [1] CRAN (R 4.2.1)
#>  janitor       2.1.0   2021-01-05 [1] CRAN (R 4.2.1)
#>  knitr         1.40    2022-08-24 [1] CRAN (R 4.2.1)
#>  lifecycle     1.0.3   2022-10-07 [1] CRAN (R 4.2.1)
#>  lubridate     1.8.0   2021-10-07 [1] CRAN (R 4.2.1)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.2.1)
#>  pillar        1.8.1   2022-08-19 [1] CRAN (R 4.2.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.2.1)
#>  purrr         0.3.5   2022-10-06 [1] CRAN (R 4.2.1)
#>  R.cache       0.16.0  2022-07-21 [1] CRAN (R 4.2.1)
#>  R.methodsS3   1.8.2   2022-06-13 [1] CRAN (R 4.2.0)
#>  R.oo          1.25.0  2022-06-12 [1] CRAN (R 4.2.0)
#>  R.utils       2.12.0  2022-06-28 [1] CRAN (R 4.2.1)
#>  R6            2.5.1   2021-08-19 [1] CRAN (R 4.2.1)
#>  reprex        2.0.2   2022-08-17 [1] CRAN (R 4.2.1)
#>  rlang         1.0.6   2022-09-24 [1] CRAN (R 4.2.1)
#>  rmarkdown     2.17    2022-10-07 [1] CRAN (R 4.2.1)
#>  rstudioapi    0.14    2022-08-22 [1] CRAN (R 4.2.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.2.1)
#>  snakecase     0.11.0  2019-05-25 [1] CRAN (R 4.2.1)
#>  stringi       1.7.8   2022-07-11 [1] CRAN (R 4.2.1)
#>  stringr       1.4.1   2022-08-20 [1] CRAN (R 4.2.1)
#>  styler        1.7.0   2022-03-13 [1] CRAN (R 4.2.1)
#>  tibble        3.1.8   2022-07-22 [1] CRAN (R 4.2.1)
#>  tidyselect    1.2.0   2022-10-10 [1] CRAN (R 4.2.1)
#>  utf8          1.2.2   2021-07-24 [1] CRAN (R 4.2.1)
#>  vctrs         0.5.0   2022-10-22 [1] CRAN (R 4.2.2)
#>  withr         2.5.0   2022-03-03 [1] CRAN (R 4.2.1)
#>  xfun          0.34    2022-10-18 [1] CRAN (R 4.2.2)
#>  yaml          2.3.6   2022-10-18 [1] CRAN (R 4.2.1)
#> 
#>  [1] C:/Users/wdenn/AppData/Local/R/win-library/4.2
#>  [2] C:/Program Files/R/R-4.2.1/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

I'm not sure what the best solution within janitor is for this. A simple solution would be to augment the default replace argument with this character. But, I think that would end up taking us down a path of making our own Unicode transliteration list (which is not something that this package has the capacity to keep up with).

Ideas are welcome.