Korean letter encoding error on Windows

Question

Korean letter encoding error on Windows

jeonghyunwoo opened this issue 7 years ago · 17 comments

Hi, I'm beginner to RSQLite and have a ciritical problem on encoding.
Belows are an example of the problem which I'm suffering.

conn = dbConnect(SQLite(), ':memory')
a = data.frame(가나다 = 1)
dbWriteTable(conn, 'a', a)
dbReadTable(conn, 'a')
Error in make.names(names(out), unique = TRUE) : invalid multibyte string 1

Answer 1 · 2017-12-09T21:22:59.000Z

Thanks. This works for me on Linux. What OS are you using?

Answer 2 · 2017-12-10T00:01:20.000Z

My pc is on windows7. Thanks : ) 2017. 12. 10. 오전 6:23에 "Kirill Müller" <notifications@github.com>님이 작성:

…

Thanks. This works for me on Linux. What OS are you using? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#233 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AQDKZSi7GzVZP6JQ7KTr4pv3tt8KfLIXks5s-vozgaJpZM4PQsXw> .

Answer 3 · 2018-12-01T05:13:25.000Z

Dear @krlmlr

I have the same bug in my course of teaching RSQLite and all of my students who use windows 7 or windows 10 have the same bug, too.
Note that previous version of RSQLite on R 3.4 did not have this issue on Windows.

Moreover, if I try to bypass this issue via adding check.names = FALSE in dbReadTable. However, if I try to append the same table via dbWriteTable(..., append = TRUE), another encoding issue related to dbListField occurs.

Since this bug is essential in my class, which is focus on using R to process Open Data in Taiwan, I am willing to send a PR to fix it if you agree.

Best,
Wush

Answer 4 · 2018-12-01T05:51:08.000Z

I write a unit test test-encoding.R as a start and the log of appveyor does show that it fails on windows.

Answer 5 · 2018-12-09T13:20:04.000Z

Hi maintainers,

Under the latest RSQLite, both the content of the data.frame and column name will be converted to UTF-8 after dbWriteTable under correct system locale (cht on windows and zh_TW.UTF-8 on others).
The bug is due to the dbReadTable and dbListFields cannot correctly process the UTF-8 encoded string.
We need to mark the string via Encoding<- manually.
After letting them correctly process the UTF-8 strings and setting the system locale, the tests are passed.
Here is the details: master...wush978:master
Should I send a PR or will you correct it by yourself?

Note that the BIG5 encoding is the default encoding for tradition Chinese on windows.
So I ask the testthat to skip the test on non-windows platform.
Otherwise, issues similar to tidyverse/tibble#87 (comment) will occur.

Answer 6 · 2018-12-30T11:25:41.000Z

Thanks. Looks like this problem is addressed by #276 (now merged). Could you please retry with the current development version?

Answer 7 · 2018-12-30T13:00:12.000Z

Just share a tip that you can install the binary version by running install.packages('https://ci.appveyor.com/api/buildjobs/xfqrv3e8ao383dkh/artifacts/RSQLite_2.1.1.9002.zip', type = 'binary', repos = NULL) (assume windows and R version >= 3.5.0)

And the following example works for me.

con <- DBI::dbConnect(RSQLite::SQLite())
tbl <- data.frame("中文" = 'a')
DBI::dbWriteTable(con, name = 'test', value = tbl)
got <- DBI::dbListFields(con, 'test')
got
Encoding(got)
DBI::dbWriteTable(con, name = 'test', value = tbl, append = TRUE)
DBI::dbReadTable(con, 'test')

Answer 8 · 2018-12-30T13:07:44.000Z

Thank you.

I'll use the appveyor to test your code based on the works in #233 (comment) .

Answer 9 · 2018-12-30T13:22:46.000Z

BTW, although the expression works, I want to mention that using Encoding(.) <- "BIG5" is actually incorrect (that's why it fails on OS other than Windows). This is because :

Character strings in R can be declared to be encoded in "latin1" or "UTF-8" or as "bytes"

A.k.a., only "latin1", "UTF-8", "bytes" and "unknown" (the native encoding) is legal. It means that the code may return garbage strings even on a Windows machine if rawToChar(as.raw(c(0xe4, 0xb8, 0xad, 0xe6, 0x96, 0x87))) cannot be represented in the locale encoding.

Answer 10 · 2018-12-30T13:43:06.000Z

The test is failed:

== testthat results  ===========================================================
  OK: 536 SKIPPED: 5 FAILED: 2
  1. Error: write a table whose colnames is BIG5 encoded (@test-encoding.R#17) 
  2. Error: write a table whose colnames is UTF-8 encoded (@test-encoding.R#33)

Compared with the previous results, the encoding issue related to dbListFields is fixed, but dbReadTable is not.

Answer 11 · 2018-12-30T14:42:38.000Z

Sorry, the failures in #233 (comment) might not be relevant. I am checking my script.

Answer 12 · 2018-12-30T14:47:44.000Z

I believe it fails due to your raw char cannot be represented on the Appveyor machine.

It's quite complicated and hard to explain the reason clearly. However, the rule of thumb is:

If you want to test UTF-8, please use cat(stringi::stri_escape_unicode('中文字符')) and write the non-ASCII strings in the format like \uxxxx
If you insist to test non-ASCII strings not in UTF8 encoding, please use the latin1 strings (the only way that can be ensured to run under any OS), see the test example in data.table like this.

UPDATE

The code in your test file expect_identical(res, df, stringsAsFactors = FALSE) should be changed to expect_identical(res, df).

Answer 13 · 2018-12-30T14:59:47.000Z

Thank you for your works, I manually confirm that the original issue is no longer existed.

And thanks for your comments about the encoding. I learned a lot from your blogs.

Answer 14 · 2018-12-30T17:56:44.000Z

@wush978: Would you like to submit your tests as a PR?

Answer 15 · 2018-12-31T05:54:34.000Z

@krlmlr I am glad to. The PR will be sent after the errors indicated by @shrektan are fixed.

Answer 16 · 2019-07-17T22:36:22.000Z

Can we close this now?

Answer 17 · 2020-12-07T00:32:20.000Z

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.