hashmap() on windows garbles UTF-8 strings!!!
dan-reznik opened this issue · 1 comments
dan-reznik commented
when accented UTF-8 key and/or value is passed to hashmap() on windows, the contents get garbled implying encoding problems, see below. here's an example on my windows 10 (sessionInfo at the end):
library(tidyverse)
library(hashmap)
> j_l1 <- "joão"
> j_l1 %>% Encoding
[1] "latin1"
# this works
> hashmap(j_l1,j_l1)
## (character) => (character)
## [joão] => [joão]
# however, hashmap does not like UTF-8's on windows.
> j_u8 <- iconv(j_l1,"latin1","UTF-8")
> j_u8 %>% Encoding
[1] "UTF-8"
# the console still displays this correctly!
> j_u8
[1] "joão"
# this is where it breaks!
> hashmap(j_u8,j_u8)
## (character) => (character)
## [joão] => [joão]
# note: somehow all tidyverse functions handle strings beautifully on windows.
# note: the reverse problem happens on linux: hashmap "likes" utf-8's but garbles latin1 strings!
>sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 stringi_1.2.4 furrr_0.1.0.9001 future_1.9.0
[5] tictoc_1.0 data.table_1.11.4 foreach_1.4.4 jsonlite_1.5
[9] glue_1.3.0 pipeR_0.6.1.3 rlist_0.4.6.1 lubridate_1.7.4
[13] forcats_0.3.0 stringr_1.3.1 dplyr_0.7.6 purrr_0.2.5
[17] readr_1.1.1 tidyr_0.8.1 tibble_1.4.2 ggplot2_3.0.0
[21] tidyverse_1.2.1 hashmap_0.2.2
loaded via a namespace (and not attached):
[1] tidyselect_0.2.4 listenv_0.7.0 haven_1.1.2 lattice_0.20-35 colorspace_1.3-2
[6] yaml_2.2.0 rlang_0.2.2 pillar_1.3.0 withr_2.1.2 modelr_0.1.2
[11] readxl_1.1.0 bindr_0.1.1 plyr_1.8.4 munsell_0.5.0 gtable_0.2.0
[16] cellranger_1.1.0 rvest_0.3.2 codetools_0.2-15 knitr_1.20 parallel_3.5.1
[21] broom_0.5.0 Rcpp_0.12.18 scales_1.0.0 backports_1.1.2 hms_0.4.2
[26] digest_0.6.15 grid_3.5.1 cli_1.0.0 tools_3.5.1 magrittr_1.5
[31] lazyeval_0.2.1 crayon_1.3.4 pkgconfig_2.0.2 xml2_1.2.0 assertthat_0.2.0
[36] httr_1.3.1 rstudioapi_0.7 iterators_1.0.10 globals_0.12.1 R6_2.2.2
[41] nlme_3.1-137 compiler_3.5.1
dan-reznik commented
this is what happens on my linux box: hashmap won't accept "latin1" strings
> p <- "João"
> p%>%Encoding
[1] "UTF-8"
# this works no problem! (unlike windows)
> hashmap(p,p)
## (character) => (character)
## [João] => [João]
> pl1 <- iconv(p,"UTF-8","latin1")
> pl1
[1] "João"
> pl1%>%Encoding
[1] "latin1"
# call below to hashmap() won't even work! this is not serious (UTF-8 is the std)
# the real problem is hashmap() on windows as in the previous comment!
> hashmap(pl1,pl1)
******** Error in nchar(.keys) : invalid multibyte string, element 1
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] dplyr_0.7.6 hashmap_0.2.2
loaded via a namespace (and not attached):
[1] Rcpp_0.12.18 codetools_0.2-15 crayon_1.3.4 assertthat_0.2.0
[5] R6_2.2.2 magrittr_1.5 pillar_1.3.0 rlang_0.2.2
[9] bindrcpp_0.2.2 tools_3.5.1 glue_1.3.0 purrr_0.2.5
[13] compiler_3.5.1 pkgconfig_2.0.1 bindr_0.1.1 tidyselect_0.2.4
[17] tibble_1.4.2