How to keep html entities (such as ©) from being converted to special characters?

Question

How to keep html entities (such as ©) from being converted to special characters?

Opened this issue a year ago · 3 comments

When I process HTML text containing HTML entities, such as ©, PreMailer.Net converts them to the corresponding characters (such as ©). The problem with this particular case is that every HTML email template we use has a footer with the copyright statement. AFAIK, © is legit HTML and should be use as-is. When PreMailer converts © to ©, it freaks out Gmail, so if you open the message with the © character in Gmail, you will see a notice that the message was trimmed, even though it wasn't. If I change © back to © and send the exactly same message, Gmail displays it correctly. So why does PreMailer think that it must convert HTML entities? And is there any way to prevent this behavior? Thanks.

Answer 1 · 2023-09-15T03:32:42.000Z

Hi @alekdavisintel

This seems to be a limitation with AngleSharp. see AngleSharp/AngleSharp#396.

I turned on the IsNotConsumingCharacterReferences like suggested in the issue but it caused some strange effects. I went ahead and commented on the original AngleSharp thread to see if they can shed any light on the correct direction.

Answer 2 · 2023-09-15T15:02:29.000Z

@alekdavisintel After more research I found that AngleSharp tokenizes the html into objects that then get outputted using a formatter. I was able to output the copyright symbol as an html entity but this doesn't include other html entities. I can try expanding this approach to handle all html entities.

However, the pitfall and possibly feature is that if the input uses a copyright symbol, premailer will automatically convert it to an html entity. This might be a good feature since email clients have issues with unicode. I'm not sure of any side effects this might cause so it might be best to turn it on via a configuration flag.

Answer 3 · 2023-09-15T16:52:29.000Z

Great, thanks. I actually implemented a workaround: after pre-mailing the template, I convert the copyright character back to the HTML entity, so it's not a priority for my use case at this time, but I appreciate the update. I would expand it to other HTML characters, at least to the common ones, like ®, ™, etc. I read the AngleSharp response, but I don't quite get the answer. Anyway, thanks a lot for looking into this. I appreciate it.