Encodings in Column Names

Question

Encodings in Column Names

clemenskuehn opened this issue a year ago · 0 comments

Summary
When using read.xlsx() on an xlsx-file with column names that partlz contain non-ASCII UTF-8 characters, the column names in the resulting data.frame end up with different encodings as well.

This can cause errors further down, e.g. in data.table, see below.

To Reproduce
Create an xlsx-file with funny column names, e.g. three columns that contain something like this

the_good | the_bäd | the_ugly
1 | 4 | 7
2 | 5 | 8
3 | 6 | 9

The following code illustrates the problem (note that the mixed encoding in the column names also exists when not using as.data.table):

library(openxlsx)
library(stringi)
testo <- as.data.table(read.xlsx("Test.xlsx"))

testo[, sum(the_good)]
testo[, sum(the_bäd)]

testo[, sum(the_good), by = the_ugly]
testo[, sum(the_bäd), by = the_ugly]

stri_enc_mark(names(testo))

Expected behavior
I would expect the problem to go awaz when all column names have the same encoding

Additional context
If you think that is rather a problem of the data.table package, let me know. But I would think that, although the problem itself is quite exotic, I would expect column names to have the same encoding throughout an imported table.