Encoding issues when using markdown
dominicroye opened this issue · 16 comments
Hi!
I have found an encoding issue when using Spanish symbols like accents; for instance, "Desde el comité organizador" looks like "Desde el comité organizador" with a markdown document encoded in UTF-8. I could find the origin in readChar() from read_text() function. If I use readLines it works fine for me. I am using Windows.
Have you any idea how to get a workaround?
Thank you very much! Congratulations on the package!
Best,
Dominic
Hi @dominicroye!
Let's fix this! I don't want Spanish users to have a poor experience. Can you please provide me with some data to work with. Maybe an example as a REPREX? Also can you please tell me what locale you are working with?
Thanks, Andrew.
Hi @datawookie.
Thank you for your reply. I made a smaller example on my GitHub: https://github.com/dominicroye/emayili_congress_aec
I just got it right on RStudio Cloud. It seems to be an issue with Windows as always. My Locale is:
> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"
Hi @dominicroye, that example link you posted doesn't seem to work anymore. I'm looking at this issue now and it would be helpful to have an example to focus on. Thanks, Andrew.
I tried to recreate the problem based on the information that you have provided and I'm having trouble replicating it (even in my default locale).
library(emayili)
>
> Attaching package: 'emayili'
> The following object is masked from 'package:graphics':
>
> text
> The following objects are masked from 'package:base':
>
> local, raw
Sys.getlocale("LC_CTYPE")
> [1] "en_ZA.UTF-8"
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
> Date: Wed, 08 Dec 2021 06:22:38 GMT
> X-Mailer: {emayili}-0.7.0
> MIME-Version: 1.0
> Content-Type: text/html;
> charset=utf-8
> Content-Disposition: inline
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>
I also tried with a Spanish locale.
library(emayili)
>
> Attaching package: 'emayili'
> The following object is masked from 'package:graphics':
>
> text
> The following objects are masked from 'package:base':
>
> local, raw
spanish <- Sys.setlocale("LC_ALL", "Spanish")
Sys.getlocale("LC_CTYPE")
> [1] "Spanish"
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
> Date: Wed, 08 Dec 2021 06:24:09 GMT
> X-Mailer: {emayili}-0.7.0
> MIME-Version: 1.0
> Content-Type: text/html;
> charset=utf-8
> Content-Disposition: inline
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>
And the es_ES.utf8
locale.
library(emayili)
>
> Attaching package: 'emayili'
> The following object is masked from 'package:graphics':
>
> text
> The following objects are masked from 'package:base':
>
> local, raw
spanish <- Sys.setlocale("LC_ALL", "es_ES.utf8")
Sys.getlocale("LC_CTYPE")
> [1] "es_ES.utf8"
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
> Date: Wed, 08 Dec 2021 06:25:06 GMT
> X-Mailer: {emayili}-0.7.0
> MIME-Version: 1.0
> Content-Type: text/html;
> charset=utf-8
> Content-Disposition: inline
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>
Unfortunately, I could not reproduce your precise locale on my Ubuntu machine, but I don't see a problem for these other Spanish locales (granted I am not an expert in Spanish or locales!).
It is a strange thing, since your code works for me too, but not when I use the external markdown document.
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
>Date: Wed, 08 Dec 2021 10:45:55 GMT
>X-Mailer: {emayili}-0.6.9
>MIME-Version: 1.0
>Content-Type: text/html; charset=utf-8
>Content-Disposition: inline
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>
Here using the markdown document you can find on GitHub:
> nombre = "Heinz"
> envelope() %>% render("msg.md") %>% as.character() %>% cat()
Date: Wed, 08 Dec 2021 10:48:37 GMT
X-Mailer: {emayili}-0.6.9
MIME-Version: 1.0
Content-Type: text/html; charset=utf-8
Content-Disposition: inline
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<img src="https://aec2022.uvigo.es/wp-content/uploads/2021/03/logo-Congreso-AEC-v2022.png" width="150" style="float:block"><p>Estimada/o Heinz,</p>
<p>Desde el comité organizador del XII Congreso AEC queremos hacer un recordatorio del calendario del congreso para todos los que, a comienzos de 2020, nos hicisteis llegar un resúmen de vuestro trabajo previsto.</p>
<ul>
<li>
<p><em>Fecha lÃmite para el envÃo de resúmenes:</em><strong>21/01/2022</strong></p>
</li>
<li>
<p><em>Comunicación de aceptación de resúmenes:</em><strong>7/02/2022</strong></p>
</li>
<li>
<p><em>Fecha lÃmite de envÃo de trabajos:</em><strong>15/04/2022</strong></p>
</li>
<li>
<p><em>Fecha lÃmite para el envÃo de los trabajos corregidos:</em><strong>17/06/2022</strong></p>
</li>
</ul>
<p>Aunque presentéis el mismo resúmen, recordad que tendréis que entrar de nuevo en la plataforma diseñada para las inscripciones y el envÃo de resúmenes y trabajos <a href="https://aec2022.uvigo.es/">https://aec2022.uvigo.es/</a>, y enviarlo antes del <em>21 de enero de 2022</em>.</p>
<p>Para consultar información sobre el congreso podéis entrar en nuestra página web que se halla instalada en la web de la <a href="http://aeclim.org/actividades/congresos-aec/">AEC</a>, o también podéis escribirnos a la secretarÃa cientÃfica: <a href="mailto:congreso.12aec@usc.es">congreso.12aec@usc.es</a></p>
<p>Esperamos que, esta vez sÃ, podamos encontrarnos todos en Santiago en octubre del próximo año.</p>
<p>Muchas gracias por vuestra comprensión.</p>
<p>Recibe un afectuoso saludo.</p>
<p><strong>El Comité Organizador</strong></p>
</body></html>
I guess the issue is maybe related to reading the markdown document?
Aha! Okay, that helps. The document which you linked to on GitHub doesn't appear to be there anymore, @dominicroye. Can you check the URL, please? This is what you included above: https://github.com/dominicroye/emayili_congress_aec, but now it gives a 404 error.
Also, now that I understand the problem better this should be a quick fix!
The markdown document is this one https://github.com/dominicroye/emayili_congress_aec/blob/main/msg.md ;-)
Thank you!!
Hi @dominicroye, cool! Thank you. Okay, can you please install this version and test?
remotes::install_github("datawookie/emayili", ref = "spanish-locale")
If that sorts the problem then I'll merge into master
.
Thanks, Andrew.
Also I couldn't open that document. Perhaps it's in a private repository?
Hi! Sorry, yes the repository was private. Now you can access the markdown document. Below you can see that it still doesn't encode correctly.
> envelope() %>% render("msg.md") %>% as.character() %>% cat()
Date: Thu, 09 Dec 2021 10:33:17 GMT
X-Mailer: {emayili}-0.7.0
MIME-Version: 1.0
Content-Type: text/html;
charset=utf-8
Content-Disposition: inline
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<img src="https://aec2022.uvigo.es/wp-content/uploads/2021/03/logo-Congreso-AEC-v2022.png" width="150" style="float:block"><p>Estimada/o Heinz,</p>
<p>Desde el comité organizador del XII Congreso AEC queremos hacer un recordatorio del calendario del congreso para todos los que, a comienzos de 2020, nos hicisteis llegar un resúmen de vuestro trabajo previsto.</p>
<ul>
<li>
<p><em>Fecha lÃmite para el envÃo de resúmenes:</em><strong>21/01/2022</strong></p>
</li>
<li>
<p><em>Comunicación de aceptación de resúmenes:</em><strong>7/02/2022</strong></p>
</li>
<li>
<p><em>Fecha lÃmite de envÃo de trabajos:</em><strong>15/04/2022</strong></p>
</li>
<li>
<p><em>Fecha lÃmite para el envÃo de los trabajos corregidos:</em><strong>17/06/2022</strong></p>
</li>
</ul>
<p>Aunque presentéis el mismo resúmen, recordad que tendréis que entrar de nuevo en la plataforma diseñada para las inscripciones y el envÃo de resúmenes y trabajos <a href="https://aec2022.uvigo.es/">https://aec2022.uvigo.es/</a>, y enviarlo antes del <em>21 de enero de 2022</em>.</p>
<p>Para consultar información sobre el congreso podéis entrar en nuestra página web que se halla instalada en la web de la <a href="http://aeclim.org/actividades/congresos-aec/">AEC</a>, o también podéis escribirnos a la secretarÃa cientÃfica: <a href="mailto:congreso.12aec@usc.es">congreso.12aec@usc.es</a></p>
<p>Esperamos que, esta vez sÃ, podamos encontrarnos todos en Santiago en octubre del próximo año.</p>
<p>Muchas gracias por vuestra comprensión.</p>
<p>Recibe un afectuoso saludo.</p>
<p><strong>El Comité Organizador</strong></p>
</body></html>
Thanks, @dominicroye. Weird because it works 100% fine for me... but I'm on Linux.
I'm puzzled though. You mentioned earlier that this works fine for you when using readLines()
. The revised implementation on the branch that I pointed you to is now using readLines()
rather than readChar()
. Can you please show me the precise code which you ran to successfully load this content without encoding issues?
Are you sure that you installed from the branch I indicated? It's the spanish-locale
branch, not master
.
remotes::install_github("datawookie/emayili", ref = "spanish-locale")
Thanks, Andrew.
Hi @dominicroye, can you please get back to me about ☝️? I'd like to resolve this issue. I'm waiting to get this sorted before pushing a new version to CRAN. Can you either get back to me soon (by Monday) or I'll have to postpone this until a later release. Thanks, Andrew.
Hi @datawookie! I just tried the changes without change. But when I read directly the "msg.md" file with readLines and the argument encoding = "UTF-8" it worked fine.
readLines("msg.md", encoding = "UTF-8")
Could be related to the wrong encoding?
Many thanks!
Ah, I think that's a pretty important detail you failed to mention!
Hi @dominicroye, made a quick tweak based on your information ☝️. Can you please install again from the same source and test? Feeling optimistic. 😄 Thanks, Andrew.
Yes! It worked! Thank you very much. :-)