datawookie/emayili

Encoding issues when using markdown

dominicroye opened this issue · 16 comments

Hi!

I have found an encoding issue when using Spanish symbols like accents; for instance, "Desde el comité organizador" looks like "Desde el comité organizador" with a markdown document encoded in UTF-8. I could find the origin in readChar() from read_text() function. If I use readLines it works fine for me. I am using Windows.

Have you any idea how to get a workaround?

Thank you very much! Congratulations on the package!

Best,

Dominic

Hi @dominicroye!

Let's fix this! I don't want Spanish users to have a poor experience. Can you please provide me with some data to work with. Maybe an example as a REPREX? Also can you please tell me what locale you are working with?

Thanks, Andrew.

Hi @datawookie.

Thank you for your reply. I made a smaller example on my GitHub: https://github.com/dominicroye/emayili_congress_aec

I just got it right on RStudio Cloud. It seems to be an issue with Windows as always. My Locale is:

> Sys.getlocale()
[1] "LC_COLLATE=Spanish_Spain.1252;LC_CTYPE=Spanish_Spain.1252;LC_MONETARY=Spanish_Spain.1252;LC_NUMERIC=C;LC_TIME=Spanish_Spain.1252"

Hi @dominicroye, that example link you posted doesn't seem to work anymore. I'm looking at this issue now and it would be helpful to have an example to focus on. Thanks, Andrew.

I tried to recreate the problem based on the information that you have provided and I'm having trouble replicating it (even in my default locale).

library(emayili)
> 
> Attaching package: 'emayili'
> The following object is masked from 'package:graphics':
> 
>     text
> The following objects are masked from 'package:base':
> 
>     local, raw
Sys.getlocale("LC_CTYPE")
> [1] "en_ZA.UTF-8"
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
> Date:                        Wed, 08 Dec 2021 06:22:38 GMT
> X-Mailer:                    {emayili}-0.7.0
> MIME-Version:                1.0
> Content-Type:                text/html;
>                               charset=utf-8
> Content-Disposition:         inline
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>

I also tried with a Spanish locale.

library(emayili)
> 
> Attaching package: 'emayili'
> The following object is masked from 'package:graphics':
> 
>     text
> The following objects are masked from 'package:base':
> 
>     local, raw
spanish <- Sys.setlocale("LC_ALL", "Spanish")
Sys.getlocale("LC_CTYPE")
> [1] "Spanish"
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
> Date:                        Wed, 08 Dec 2021 06:24:09 GMT
> X-Mailer:                    {emayili}-0.7.0
> MIME-Version:                1.0
> Content-Type:                text/html;
>                               charset=utf-8
> Content-Disposition:         inline
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>

And the es_ES.utf8 locale.

library(emayili)
> 
> Attaching package: 'emayili'
> The following object is masked from 'package:graphics':
> 
>     text
> The following objects are masked from 'package:base':
> 
>     local, raw
spanish <- Sys.setlocale("LC_ALL", "es_ES.utf8")
Sys.getlocale("LC_CTYPE")
> [1] "es_ES.utf8"
envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
> Date:                        Wed, 08 Dec 2021 06:25:06 GMT
> X-Mailer:                    {emayili}-0.7.0
> MIME-Version:                1.0
> Content-Type:                text/html;
>                               charset=utf-8
> Content-Disposition:         inline
> 
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>

Unfortunately, I could not reproduce your precise locale on my Ubuntu machine, but I don't see a problem for these other Spanish locales (granted I am not an expert in Spanish or locales!).

It is a strange thing, since your code works for me too, but not when I use the external markdown document.

envelope() %>% render("Desde el comité organizador") %>% as.character() %>% cat()
>Date:                        Wed, 08 Dec 2021 10:45:55 GMT
>X-Mailer:                    {emayili}-0.6.9
>MIME-Version:                1.0
>Content-Type:                text/html; charset=utf-8
>Content-Disposition:         inline
>
> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
> <html><body><p>Desde el comité organizador</p></body></html>

Here using the markdown document you can find on GitHub:

> nombre = "Heinz"
> envelope() %>% render("msg.md") %>% as.character() %>% cat()
Date:                        Wed, 08 Dec 2021 10:48:37 GMT
X-Mailer:                    {emayili}-0.6.9
MIME-Version:                1.0
Content-Type:                text/html; charset=utf-8
Content-Disposition:         inline

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<img src="https://aec2022.uvigo.es/wp-content/uploads/2021/03/logo-Congreso-AEC-v2022.png" width="150" style="float:block"><p>Estimada/o Heinz,</p>
<p>Desde el comité organizador del XII Congreso AEC queremos hacer un recordatorio del calendario del congreso para todos los que, a comienzos de 2020, nos hicisteis llegar un resúmen de vuestro trabajo previsto.</p>
<ul>
<li>
<p><em>Fecha límite para el envío de resúmenes:</em><strong>21/01/2022</strong></p>
</li>
<li>
<p><em>Comunicación de aceptación de resúmenes:</em><strong>7/02/2022</strong></p>
</li>
<li>
<p><em>Fecha límite de envío de trabajos:</em><strong>15/04/2022</strong></p>
</li>
<li>
<p><em>Fecha límite para el envío de los trabajos corregidos:</em><strong>17/06/2022</strong></p>
</li>
</ul>
<p>Aunque presentéis el mismo resúmen, recordad que tendréis que entrar de nuevo en la plataforma diseñada para las inscripciones y el envío de resúmenes y trabajos <a href="https://aec2022.uvigo.es/">https://aec2022.uvigo.es/</a>, y enviarlo antes del <em>21 de enero de 2022</em>.</p>
<p>Para consultar información sobre el congreso podéis entrar en nuestra página web que se halla instalada en la web de la <a href="http://aeclim.org/actividades/congresos-aec/">AEC</a>, o también podéis escribirnos a la secretaría científica: <a href="mailto:congreso.12aec@usc.es">congreso.12aec@usc.es</a></p>
<p>Esperamos que, esta vez sí, podamos encontrarnos todos en Santiago en octubre del próximo año.</p>
<p>Muchas gracias por vuestra comprensión.</p>
<p>Recibe un afectuoso saludo.</p>
<p><strong>El Comité Organizador</strong></p>
</body></html>

I guess the issue is maybe related to reading the markdown document?

Aha! Okay, that helps. The document which you linked to on GitHub doesn't appear to be there anymore, @dominicroye. Can you check the URL, please? This is what you included above: https://github.com/dominicroye/emayili_congress_aec, but now it gives a 404 error.

Also, now that I understand the problem better this should be a quick fix!

The markdown document is this one https://github.com/dominicroye/emayili_congress_aec/blob/main/msg.md ;-)

Thank you!!

Hi @dominicroye, cool! Thank you. Okay, can you please install this version and test?

remotes::install_github("datawookie/emayili", ref = "spanish-locale")

If that sorts the problem then I'll merge into master.

Thanks, Andrew.

Also I couldn't open that document. Perhaps it's in a private repository?

Hi! Sorry, yes the repository was private. Now you can access the markdown document. Below you can see that it still doesn't encode correctly.

> envelope() %>% render("msg.md") %>% as.character() %>% cat()
Date:                        Thu, 09 Dec 2021 10:33:17 GMT
X-Mailer:                    {emayili}-0.7.0
MIME-Version:                1.0
Content-Type:                text/html;
                              charset=utf-8
Content-Disposition:         inline

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body>
<img src="https://aec2022.uvigo.es/wp-content/uploads/2021/03/logo-Congreso-AEC-v2022.png" width="150" style="float:block"><p>Estimada/o Heinz,</p>
<p>Desde el comité organizador del XII Congreso AEC queremos hacer un recordatorio del calendario del congreso para todos los que, a comienzos de 2020, nos hicisteis llegar un resúmen de vuestro trabajo previsto.</p>
<ul>
<li>
<p><em>Fecha límite para el envío de resúmenes:</em><strong>21/01/2022</strong></p>
</li>
<li>
<p><em>Comunicación de aceptación de resúmenes:</em><strong>7/02/2022</strong></p>
</li>
<li>
<p><em>Fecha límite de envío de trabajos:</em><strong>15/04/2022</strong></p>
</li>
<li>
<p><em>Fecha límite para el envío de los trabajos corregidos:</em><strong>17/06/2022</strong></p>
</li>
</ul>
<p>Aunque presentéis el mismo resúmen, recordad que tendréis que entrar de nuevo en la plataforma diseñada para las inscripciones y el envío de resúmenes y trabajos <a href="https://aec2022.uvigo.es/">https://aec2022.uvigo.es/</a>, y enviarlo antes del <em>21 de enero de 2022</em>.</p>
<p>Para consultar información sobre el congreso podéis entrar en nuestra página web que se halla instalada en la web de la <a href="http://aeclim.org/actividades/congresos-aec/">AEC</a>, o también podéis escribirnos a la secretaría científica: <a href="mailto:congreso.12aec@usc.es">congreso.12aec@usc.es</a></p>
<p>Esperamos que, esta vez sí, podamos encontrarnos todos en Santiago en octubre del próximo año.</p>
<p>Muchas gracias por vuestra comprensión.</p>
<p>Recibe un afectuoso saludo.</p>
<p><strong>El Comité Organizador</strong></p>
</body></html>

Thanks, @dominicroye. Weird because it works 100% fine for me... but I'm on Linux.

I'm puzzled though. You mentioned earlier that this works fine for you when using readLines(). The revised implementation on the branch that I pointed you to is now using readLines() rather than readChar(). Can you please show me the precise code which you ran to successfully load this content without encoding issues?

Are you sure that you installed from the branch I indicated? It's the spanish-locale branch, not master.

remotes::install_github("datawookie/emayili", ref = "spanish-locale")

Thanks, Andrew.

Hi @dominicroye, can you please get back to me about ☝️? I'd like to resolve this issue. I'm waiting to get this sorted before pushing a new version to CRAN. Can you either get back to me soon (by Monday) or I'll have to postpone this until a later release. Thanks, Andrew.

Hi @datawookie! I just tried the changes without change. But when I read directly the "msg.md" file with readLines and the argument encoding = "UTF-8" it worked fine.

readLines("msg.md", encoding = "UTF-8")

Could be related to the wrong encoding?

Many thanks!

Ah, I think that's a pretty important detail you failed to mention!

Hi @dominicroye, made a quick tweak based on your information ☝️. Can you please install again from the same source and test? Feeling optimistic. 😄 Thanks, Andrew.

Yes! It worked! Thank you very much. :-)