r-dbi/RSQLite

Korean letter encoding error on Windows

jeonghyunwoo opened this issue · 17 comments

Hi, I'm beginner to RSQLite and have a ciritical problem on encoding.
Belows are an example of the problem which I'm suffering.

conn = dbConnect(SQLite(), ':memory')
a = data.frame(가나다 = 1)
dbWriteTable(conn, 'a', a)
dbReadTable(conn, 'a')
Error in make.names(names(out), unique = TRUE) : invalid multibyte string 1

Thanks. This works for me on Linux. What OS are you using?

Dear @krlmlr

I have the same bug in my course of teaching RSQLite and all of my students who use windows 7 or windows 10 have the same bug, too.
Note that previous version of RSQLite on R 3.4 did not have this issue on Windows.

Moreover, if I try to bypass this issue via adding check.names = FALSE in dbReadTable. However, if I try to append the same table via dbWriteTable(..., append = TRUE), another encoding issue related to dbListField occurs.

Since this bug is essential in my class, which is focus on using R to process Open Data in Taiwan, I am willing to send a PR to fix it if you agree.

Best,
Wush

I write a unit test test-encoding.R as a start and the log of appveyor does show that it fails on windows.

Hi maintainers,

Under the latest RSQLite, both the content of the data.frame and column name will be converted to UTF-8 after dbWriteTable under correct system locale (cht on windows and zh_TW.UTF-8 on others).
The bug is due to the dbReadTable and dbListFields cannot correctly process the UTF-8 encoded string.
We need to mark the string via Encoding<- manually.
After letting them correctly process the UTF-8 strings and setting the system locale, the tests are passed.
Here is the details: master...wush978:master
Should I send a PR or will you correct it by yourself?

Note that the BIG5 encoding is the default encoding for tradition Chinese on windows.
So I ask the testthat to skip the test on non-windows platform.
Otherwise, issues similar to tidyverse/tibble#87 (comment) will occur.

Thanks. Looks like this problem is addressed by #276 (now merged). Could you please retry with the current development version?

Just share a tip that you can install the binary version by running install.packages('https://ci.appveyor.com/api/buildjobs/xfqrv3e8ao383dkh/artifacts/RSQLite_2.1.1.9002.zip', type = 'binary', repos = NULL) (assume windows and R version >= 3.5.0)

And the following example works for me.

con <- DBI::dbConnect(RSQLite::SQLite())
tbl <- data.frame("中文" = 'a')
DBI::dbWriteTable(con, name = 'test', value = tbl)
got <- DBI::dbListFields(con, 'test')
got
Encoding(got)
DBI::dbWriteTable(con, name = 'test', value = tbl, append = TRUE)
DBI::dbReadTable(con, 'test')

Thank you.

I'll use the appveyor to test your code based on the works in #233 (comment) .

BTW, although the expression works, I want to mention that using Encoding(.) <- "BIG5" is actually incorrect (that's why it fails on OS other than Windows). This is because :

Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes"

A.k.a., only "latin1", "UTF-8", "bytes" and "unknown" (the native encoding) is legal. It means that the code may return garbage strings even on a Windows machine if rawToChar(as.raw(c(0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87))) cannot be represented in the locale encoding.

The test is failed:

== testthat results  ===========================================================
  OK: 536 SKIPPED: 5 FAILED: 2
  1. Error: write a table whose colnames is BIG5 encoded (@test-encoding.R#17) 
  2. Error: write a table whose colnames is UTF-8 encoded (@test-encoding.R#33) 

Compared with the previous results, the encoding issue related to dbListFields is fixed, but dbReadTable is not.

Sorry, the failures in #233 (comment) might not be relevant. I am checking my script.

I believe it fails due to your raw char cannot be represented on the Appveyor machine.

It's quite complicated and hard to explain the reason clearly. However, the rule of thumb is:

  1. If you want to test UTF-8, please use cat(stringi::stri_escape_unicode('中文字符')) and write the non-ASCII strings in the format like \uxxxx

  2. If you insist to test non-ASCII strings not in UTF8 encoding, please use the latin1 strings (the only way that can be ensured to run under any OS), see the test example in data.table like this.

UPDATE

The code in your test file expect_identical(res, df, stringsAsFactors = FALSE) should be changed to expect_identical(res, df).

Thank you for your works, I manually confirm that the original issue is no longer existed.

And thanks for your comments about the encoding. I learned a lot from your blogs.

@wush978: Would you like to submit your tests as a PR?

@krlmlr I am glad to. The PR will be sent after the errors indicated by @shrektan are fixed.

Can we close this now?

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.