nathan-russell/hashmap

hashmap() on windows garbles UTF-8 strings!!!

dan-reznik opened this issue · 1 comments

when accented UTF-8 key and/or value is passed to hashmap() on windows, the contents get garbled implying encoding problems, see below. here's an example on my windows 10 (sessionInfo at the end):

library(tidyverse)
library(hashmap)

> j_l1 <- "joão"
> j_l1 %>% Encoding
[1] "latin1"
# this works
> hashmap(j_l1,j_l1)
## (character) => (character)
##      [joão] => [joão]     

# however, hashmap does not like UTF-8's on windows.
> j_u8 <- iconv(j_l1,"latin1","UTF-8")
> j_u8 %>% Encoding
[1] "UTF-8"
# the console still displays this correctly!
> j_u8
[1] "joão"
# this is where it breaks!
> hashmap(j_u8,j_u8)
## (character) => (character)
##     [joão] => [joão]  

# note: somehow all tidyverse functions handle strings beautifully on windows.
# note: the reverse problem happens on linux: hashmap "likes" utf-8's but garbles latin1 strings!  

>sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] bindrcpp_0.2.2    stringi_1.2.4     furrr_0.1.0.9001  future_1.9.0     
 [5] tictoc_1.0        data.table_1.11.4 foreach_1.4.4     jsonlite_1.5     
 [9] glue_1.3.0        pipeR_0.6.1.3     rlist_0.4.6.1     lubridate_1.7.4  
[13] forcats_0.3.0     stringr_1.3.1     dplyr_0.7.6       purrr_0.2.5      
[17] readr_1.1.1       tidyr_0.8.1       tibble_1.4.2      ggplot2_3.0.0    
[21] tidyverse_1.2.1   hashmap_0.2.2    

loaded via a namespace (and not attached):
 [1] tidyselect_0.2.4 listenv_0.7.0    haven_1.1.2      lattice_0.20-35  colorspace_1.3-2
 [6] yaml_2.2.0       rlang_0.2.2      pillar_1.3.0     withr_2.1.2      modelr_0.1.2    
[11] readxl_1.1.0     bindr_0.1.1      plyr_1.8.4       munsell_0.5.0    gtable_0.2.0    
[16] cellranger_1.1.0 rvest_0.3.2      codetools_0.2-15 knitr_1.20       parallel_3.5.1  
[21] broom_0.5.0      Rcpp_0.12.18     scales_1.0.0     backports_1.1.2  hms_0.4.2       
[26] digest_0.6.15    grid_3.5.1       cli_1.0.0        tools_3.5.1      magrittr_1.5    
[31] lazyeval_0.2.1   crayon_1.3.4     pkgconfig_2.0.2  xml2_1.2.0       assertthat_0.2.0
[36] httr_1.3.1       rstudioapi_0.7   iterators_1.0.10 globals_0.12.1   R6_2.2.2        
[41] nlme_3.1-137     compiler_3.5.1 

this is what happens on my linux box: hashmap won't accept "latin1" strings

> p <- "João"
> p%>%Encoding
[1] "UTF-8"
# this works no problem! (unlike windows)
> hashmap(p,p)
## (character) => (character)
##      [João] => [João]

> pl1 <- iconv(p,"UTF-8","latin1")
> pl1
[1] "João"
> pl1%>%Encoding
[1] "latin1"
# call below to hashmap() won't even work! this is not serious (UTF-8 is the std)
# the real problem is hashmap() on windows as in the previous comment!
> hashmap(pl1,pl1)
******** Error in nchar(.keys) : invalid multibyte string, element 1

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] dplyr_0.7.6   hashmap_0.2.2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.18     codetools_0.2-15 crayon_1.3.4     assertthat_0.2.0
 [5] R6_2.2.2         magrittr_1.5     pillar_1.3.0     rlang_0.2.2
 [9] bindrcpp_0.2.2   tools_3.5.1      glue_1.3.0       purrr_0.2.5
[13] compiler_3.5.1   pkgconfig_2.0.1  bindr_0.1.1      tidyselect_0.2.4
[17] tibble_1.4.2