IMSMWU/RClickhouse

[windows] non-ASCII Unicode characters are garbled on Windows

Opened this issue · 4 comments

All non-ASCII Unicode characters are garbled in dataframes obtained via SELECT queries. Data is stored in UTF-8 in Clickhouse, but native R encoding on my Windows machine is CP1251.

Is there a way to preserve character encoding in RClickhouse? For now, I can only correct it manually with set_chr_encoding function.

For example, RMariaDB package is working correctly with UTF-8 strings on Windows.

library(tidyverse)
library(DBI)
library(RClickhouse)
library(rlang)

ch_db <- DBI::dbConnect(RClickhouse::clickhouse())

test_df <- ch_db %>%
  dbGetQuery("select * from test") %>%
  as_tibble()

# View result with garbled cyrillic characters
test_df
#> # A tibble: 1 x 2
#>   translit russian     
#> * <chr>    <chr>       
#> 1 Moskva   РњРѕСЃРєРІР°

# Check encoding
test_df %>%
  purrr::map_dfc(Encoding)
#> # A tibble: 1 x 2
#>   translit russian
#>   <chr>    <chr>  
#> 1 unknown  unknown

# Set UTF-8 encoding for correct display
test_df %>%
  purrr::map_dfc(rlang::set_chr_encoding, encoding = "UTF-8")
#> # A tibble: 1 x 2
#>   translit russian
#>   <chr>    <chr>  
#> 1 Moskva   Москва
Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  Russian_Russia.1251         
#>  tz       Europe/Moscow               
#>  date     2018-02-01
#> Packages -----------------------------------------------------------------
#>  package     * version date       source        
#>  assertthat    0.2.0   2017-04-11 CRAN (R 3.4.1)
#>  backports     1.1.2   2017-12-13 CRAN (R 3.4.3)
#>  base        * 3.4.3   2017-12-06 local         
#>  bindr         0.1     2016-11-13 CRAN (R 3.4.1)
#>  bindrcpp      0.2     2017-06-17 CRAN (R 3.4.1)
#>  broom         0.4.3   2017-11-20 CRAN (R 3.4.3)
#>  cellranger    1.1.0   2016-07-27 CRAN (R 3.4.1)
#>  cli           1.0.0   2017-11-05 CRAN (R 3.4.2)
#>  colorspace    1.3-2   2016-12-14 CRAN (R 3.4.1)
#>  compiler      3.4.3   2017-12-06 local         
#>  crayon        1.3.4   2017-09-16 CRAN (R 3.4.2)
#>  datasets    * 3.4.3   2017-12-06 local         
#>  DBI         * 0.7     2017-06-18 CRAN (R 3.4.1)
#>  dbplyr        1.2.0   2018-01-03 CRAN (R 3.4.3)
#>  devtools      1.13.4  2017-11-09 CRAN (R 3.4.3)
#>  digest        0.6.15  2018-01-28 CRAN (R 3.4.3)
#>  dplyr       * 0.7.4   2017-09-28 CRAN (R 3.4.2)
#>  evaluate      0.10.1  2017-06-24 CRAN (R 3.4.1)
#>  forcats     * 0.2.0   2017-01-23 CRAN (R 3.4.1)
#>  foreign       0.8-69  2017-06-22 CRAN (R 3.4.3)
#>  ggplot2     * 2.2.1   2016-12-30 CRAN (R 3.4.1)
#>  glue          1.2.0   2017-10-29 CRAN (R 3.4.2)
#>  graphics    * 3.4.3   2017-12-06 local         
#>  grDevices   * 3.4.3   2017-12-06 local         
#>  grid          3.4.3   2017-12-06 local         
#>  gtable        0.2.0   2016-02-26 CRAN (R 3.4.1)
#>  haven         1.1.1   2018-01-18 CRAN (R 3.4.3)
#>  hms           0.4.1   2018-01-24 CRAN (R 3.4.3)
#>  htmltools     0.3.6   2017-04-28 CRAN (R 3.4.1)
#>  httr          1.3.1   2017-08-20 CRAN (R 3.4.1)
#>  jsonlite      1.5     2017-06-01 CRAN (R 3.4.1)
#>  knitr         1.19    2018-01-29 CRAN (R 3.4.3)
#>  lattice       0.20-35 2017-03-25 CRAN (R 3.4.3)
#>  lazyeval      0.2.1   2017-10-29 CRAN (R 3.4.2)
#>  lubridate     1.7.1   2017-11-03 CRAN (R 3.4.2)
#>  magrittr      1.5     2014-11-22 CRAN (R 3.4.1)
#>  memoise       1.1.0   2017-04-21 CRAN (R 3.4.2)
#>  methods     * 3.4.3   2017-12-06 local         
#>  mnormt        1.5-5   2016-10-15 CRAN (R 3.4.1)
#>  modelr        0.1.1   2017-07-24 CRAN (R 3.4.1)
#>  munsell       0.4.3   2016-02-13 CRAN (R 3.4.1)
#>  nlme          3.1-131 2017-02-06 CRAN (R 3.4.3)
#>  parallel      3.4.3   2017-12-06 local         
#>  pillar        1.1.0   2018-01-14 CRAN (R 3.4.3)
#>  pkgconfig     2.0.1   2017-03-21 CRAN (R 3.4.1)
#>  plyr          1.8.4   2016-06-08 CRAN (R 3.4.1)
#>  psych         1.7.8   2017-09-09 CRAN (R 3.4.1)
#>  purrr       * 0.2.4   2017-10-18 CRAN (R 3.4.2)
#>  R6            2.2.2   2017-06-17 CRAN (R 3.4.1)
#>  RClickhouse * 0.4.0   2018-01-20 CRAN (R 3.4.3)
#>  Rcpp          0.12.15 2018-01-20 CRAN (R 3.4.3)
#>  readr       * 1.1.1   2017-05-16 CRAN (R 3.4.1)
#>  readxl        1.0.0   2017-04-18 CRAN (R 3.4.1)
#>  reshape2      1.4.3   2017-12-11 CRAN (R 3.4.3)
#>  rlang       * 0.1.6   2017-12-21 CRAN (R 3.4.3)
#>  rmarkdown     1.8     2017-11-17 CRAN (R 3.4.3)
#>  rprojroot     1.3-2   2018-01-03 CRAN (R 3.4.3)
#>  rstudioapi    0.7     2017-09-07 CRAN (R 3.4.2)
#>  rvest         0.3.2   2016-06-17 CRAN (R 3.4.1)
#>  scales        0.5.0   2017-08-24 CRAN (R 3.4.1)
#>  stats       * 3.4.3   2017-12-06 local         
#>  stringi       1.1.6   2017-11-17 CRAN (R 3.4.2)
#>  stringr     * 1.2.0   2017-02-18 CRAN (R 3.4.1)
#>  tibble      * 1.4.2   2018-01-22 CRAN (R 3.4.3)
#>  tidyr       * 0.8.0   2018-01-29 CRAN (R 3.4.3)
#>  tidyverse   * 1.2.1   2017-11-14 CRAN (R 3.4.3)
#>  tools         3.4.3   2017-12-06 local         
#>  utf8          1.1.3   2018-01-03 CRAN (R 3.4.3)
#>  utils       * 3.4.3   2017-12-06 local         
#>  withr         2.1.1   2017-12-19 CRAN (R 3.4.3)
#>  xml2          1.2.0   2018-01-24 CRAN (R 3.4.3)
#>  yaml          2.1.16  2017-12-12 CRAN (R 3.4.3)

First, thanks for pointing at this issue. Encoding is an important topic and we have to put some effort on it. Feel free to send some push requests, otherwise I'll try to look into it within the next weeks.

Thanks for fast reply. My knowledge of C language is very little, so I can't help you much with pull request. Personally, I think that using
rlang::set_chr_encoding or base::Encoding on the R side will not be very efficient and encoding should be somehow set on C level.

Here is some info about UTF-8 strings, I hope that it will be useful:

v5r commented

Thanks for the very informative bug report. I've looked into the issue, and I think I know what's going on: we're converting the input to UTF-8 for storage, but not back to the current encoding when reading from the database.

I can see two obvious solutions to this, each of which has some disadvantages:

  • Keep encoding data to UTF-8 before storing it in the database, and always re-encode the query result to the current encoding. This should solve your issue, but writes different data (specifically, a different encoding) to the database than the user may have intended. Also, the UTF-8 query result might contain characters which cannot be represented in the current encoding (say, Chinese characters in CP1251).
  • Always store data in the unmodified input encoding. The advantage would be that it would store precisely the data being input. But obviously there is no "normalization" of the encoding, so if you were to write data in CP1251 encoding an then read in UTF-8, that wouldn't work correctly either.

Another question is how to handle DB identifiers and such (Clickhouse accepts arbitrary byte strings for string fields; but column names, enum levels etc. must apparently be valid UTF-8, though I can't seem to find any mention of that in the docs). Here, the first approach is likely the only feasible solution, though that again leads to the issue of query results containing characters which cannot be represented in the current encoding.

Generally, I have to say that, seeing what a mess these encoding issues are, I strongly recommend always using UTF-8 everywhere. I'm not saying we shouldn't try to fix the bug though, and I understand that this isn't always a feasible solution on Windows in particular.

I'd personally prefer the second solution, I think, but as I said, that won't fix all issues either.

If you want to choose the most flexible solution, you can add encoding argument to the dbConnect method with the default value UTF-8. So, with the default value all strings will be encoded to UTF-8 before storing it in the database (if they are not already in UTF-8). All query results must also be in UTF-8, there is no need to convert them to native encoding.

UTF-8 is recommended by Clickhouse documentation:

ClickHouse doesn't have the concept of encodings. Strings can contain an arbitrary set of bytes, which are stored and output as-is. If you need to store texts, we recommend using UTF-8 encoding. At the very least, if your terminal uses UTF-8 (as recommended), you can read and write your values without making conversions.