Encoding issue on windows
ManuelHentschel opened this issue · 2 comments
First off, thanks for this very useful package!
As stated in the title, I'm having a problem with the encoding of special characters on windows. I read the corresponding part of the readme, but did not manage to solve this issue without modifying the Rdpack package itself (or maybe by switching everything to native encoding, but I'd like to avoid that).
My setup is
- windows 10
- R v4.1.0
- Manual install of Rdpack from the master branch of this repo (though I had the same problems with the CRAN version)
- Local package-project where all files are encoded as UTF-8 (DESCRIPTION, R-files, Rd-files, bibliography) and I have specified this in the DESCRIPTION file.
If I try to cite bibliography entries containing special characters (German umlauts most of the time), they do not show up correctly in the output. A minimal example producing this issue for me would be for example:
\insertRef{DiaLop2020ejor}{Rdpack}
Below is a more detailed example .Rd illustrating this behavior, and the corresponding HTML produced when installing the package. It seems to me that the output of \Sexpr
is always expected to have native encoding (i.e. latin1
on my machine) and e.g. \InsertRef
produces strings that are UTF-8 encoded. Wrapping the corresponding R functions in enc2native
seems to fix the issue.
Content of an .Rd file:
\name{someTest}
\alias{someTest}
\title{Encoding Test}
\section{Trying to write the umlaut oe:}{
\describe{
\item{Normal Rd:}{ö}
\item{\code{\\Sexpr}:}{\Sexpr[results=rd,stage=build]{("ö")}}
\item{Encoding in \code{\\Sexpr:}}{\Sexpr[results=rd,stage=build]{Encoding("ö")}}
\item{\code{\\Sexpr} with \code{enc2utf8}}{\Sexpr[results=rd,stage=build]{enc2utf8("ö")}}
\item{\code{\\Sexpr} with \code{enc2native}}{\Sexpr[results=rd,stage=build]{enc2native("ö")}}
}
}
\section{Trying to cite something:}{
\describe{
\item{\code{\\insertRef}:}{\insertRef{DiaLop2020ejor}{Rdpack}}
\item{\code{\\Sexpr}:}{\Sexpr[results=rd,stage=build]{Rdpack::insert_all_ref(t(c('DiaLop2020ejor', 'Rdpack')))}}
\item{Encoding in \code{\\Sexpr}:}{\Sexpr[results=rd,stage=build]{Encoding(Rdpack::insert_all_ref(t(c('DiaLop2020ejor', 'Rdpack'))))}}
\item{\code{\\Sexpr} with \code{enc2native}:}{\Sexpr[results=rd,stage=build]{enc2native(Rdpack::insert_all_ref(t(c('DiaLop2020ejor', 'Rdpack'))))}}
}
}
Screenshot of the rendered help page:
HTML generated by R CMD INSTALL --html .
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>R: Encoding Test</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<link rel="stylesheet" type="text/css" href="R.css" />
</head><body>
<table width="100%" summary="page for someTest {SomePackage}"><tr><td>someTest {SomePackage}</td><td style="text-align: right;">R Documentation</td></tr></table>
<h2>Encoding Test</h2>
<h3>Trying to write the umlaut oe:</h3>
<dl>
<dt>Normal Rd:</dt><dd><p>ö</p>
</dd>
<dt><code>\Sexpr</code>:</dt><dd><p>ö</p>
</dd>
<dt>Encoding in <code>\Sexpr:</code></dt><dd><p>latin1</p>
</dd>
<dt><code>\Sexpr</code> with <code>enc2utf8</code></dt><dd><p>ö</p>
</dd>
<dt><code>\Sexpr</code> with <code>enc2native</code></dt><dd><p>ö</p>
</dd>
</dl>
<h3>Trying to cite something:</h3>
<dl>
<dt><code>\insertRef</code>:</dt><dd><p>Juan
Esteban Diaz, Manuel López-Ibáñez (2021).
“Incorporating Decision-Maker's Preferences into the Automatic Configuration of Bi-Objective Optimisation Algorithms.”
<em>European Journal of Operational Research</em>, <b>289</b>(3), 1209–1222.
doi: <a href="https://doi.org/10.1016/j.ejor.2020.07.059">10.1016/j.ejor.2020.07.059</a>.</p>
</dd>
<dt><code>\Sexpr</code>:</dt><dd><p>Juan
Esteban Diaz, Manuel López-Ibáñez (2021).
“Incorporating Decision-Maker's Preferences into the Automatic Configuration of Bi-Objective Optimisation Algorithms.”
<em>European Journal of Operational Research</em>, <b>289</b>(3), 1209–1222.
doi: <a href="https://doi.org/10.1016/j.ejor.2020.07.059">10.1016/j.ejor.2020.07.059</a>.</p>
</dd>
<dt>Encoding in <code>\Sexpr</code>:</dt><dd><p>UTF-8</p>
</dd>
<dt><code>\Sexpr</code> with <code>enc2native</code>:</dt><dd><p>Juan
Esteban Diaz, Manuel López-Ibáñez (2021).
“Incorporating Decision-Maker's Preferences into the Automatic Configuration of Bi-Objective Optimisation Algorithms.”
<em>European Journal of Operational Research</em>, <b>289</b>(3), 1209–1222.
doi: <a href="https://doi.org/10.1016/j.ejor.2020.07.059">10.1016/j.ejor.2020.07.059</a>.</p>
</dd>
</dl>
<hr /><div style="text-align: center;">[Package <em>SomePackage</em> version 5.5.110 <a href="00Index.html">Index</a>]</div>
</body></html>
Thanks for the report, the detailed investigation, and examples. This was a long standing problem on Windows which should have disappeared with R >= 4.2-0. Could you try installing a more recent version of R (currently v4.2-2) and try with that?
Please let me know if you succeed. If you can supply a link to your bib file, I could investiggate myself, as well.
It is a long story and Tomas Kalibera from R-core has a number of posts and blogs about that but basically Windows was converting UTF-8 to the locale (code page) and then back. In the process, character that are not available in that locale were replaced by 'approximations' causing havoc for characters not in that locale. Windows now has proper UTF-8 locale and R-4.2 and later use that. As a consequence,
\enc2native
helps in some cases but is not a universal solution and causes its own problems since its success depends on the local encoding and on the particular characters involved.
Thanks for the quick response!
Upgrading to 4.2.0 solved the problem, both in the example above and the original project.