yihui/servr

Unicode not working in inf_mr()

eternal-flame-AD opened this issue · 3 comments

---
title: "Xaringan inf_mr"
output: xaringan::moon_reader
---

無限 `r system2("python", c("-c", shQuote('print("月読")')), stdout = TRUE)`

If I render through rmarkdown::render I get the expected "無限 月読" but if I try to use inf_mr I just get this message and a blank output:

Warning message:
In grep("<!-- DISABLE-SERVR-WEBSOCKET -->", body, fixed = TRUE) :
  input string 1 is invalid in this locale

It seems like this is coming from here: 69f1279

I tried to adjust the locale settings, if I do Sys.setlocale("LC_ALL", "Ja_JP.UTF-8") it fixes the above issue but now it doesn't decode the stdout correctly, I get: 無限 <8c><8e><93>

Some more locale gymnastics within the document probably could fix that but I think dynamic_site shouldn't assume the body is in the system locale.

My OS locale is English display and shift-JIS codepage.

[ins] r$> sessionInfo()
R version 4.2.3 (2023-03-15 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.932  LC_CTYPE=English_United States.932    LC_MONETARY=English_United States.932 LC_NUMERIC=C                          LC_TIME=English_United States.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] xaringan_0.28.1

loaded via a namespace (and not attached):
[1] compiler_4.2.3  fastmap_1.1.1   cli_3.6.0       htmltools_0.5.4 xfun_0.37       digest_0.6.31   rlang_1.1.0
yihui commented

Sys.setlocale("LC_ALL", "Ja_JP.UTF-8")may not be enough, since it is only for changing the locale for R, but not for your operating system. Have you tried to set the locale to UTF-8 system-wide? (I don't use Windows but I assume you can do it in the control panel) Ideally when you restart your system and R, sessionInfo() should show the UTF-8 locale.

For useBytes = TRUE, I was following an R core member's suggestion: https://blog.r-project.org/2022/10/10/improvements-in-handling-bytes-encoding/index.html

I can certainly revert 69f1279 if necessary. Thanks!

I read the article you mentioned and I think I know where the discrepancy was coming from, the reason is because this line:

if (is.raw(body)) body = rawToChar(body)

This assumes body is in system locale but HTML is should be automatically UTF-8 as declared in the meta tag. I think we should change it to something like this:

      if (is.raw(body)) {
        body = rawToChar(body)
        Encoding(body) = "UTF-8"
      }

I tested this and it fixes the issue.

yihui commented

Great! That is also what I guessed (I should have declared the encoding explicitly). I'll commit the fix in a minute. Thanks!