getnikola/nikola

Pages with Emoji fail to render on macOS (due to lxml bug)

LudovicRousseau opened this issue · 4 comments

Environment

Python Version:
Python 3.11.3 (main, Apr 7 2023, 19:29:16) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Installed from Homebrew

Nikola Version:
Nikola 8.2.4

Operating System:
macOS Monterey 12.6.5

Description:

If I use a Unicode character like a smiley or 😺 in a .rst page then the generated html is bogus.
Source unicode.rst page:

.. title: Unicode
.. slug: unicode
.. date: 2023-05-08 18:48:37 UTC+02:00
.. tags: 
.. category: 
.. link: 
.. description: 
.. type: text

😺

The generated html page contains:

[...]
</header><div class="e-content entry-content" itemprop="articleBody text">
    <p>h   t   m   l   &gt;   </p>
    </div>
[...]

And in the browser I see: "h t m l > " for the content of the post.

I have no problem with another Unicode character like an accented letter like "è".

Debian is OK

I then tried the same manipulation on a Debian GNU/Linux version 12 (the next Debian stable) and I have no problem.
On Debian I use:

  • Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
  • Nikola 8.2.4

In both cases I use a venv.

Debug

I tried to debug but I am new to Nikola.
I tried nikola rst2html.

On macOS I get:

$ nikola rst2html posts/2023/05/unicode.rst 

<!DOCTYPE html>
<html><body><p>!   D   O   C   T   Y   P   E       h   t   m   l   &gt;   
   </p></body></html>

Here again the result is correct if run on Debian.

Maybe the problem is in a dependency used by Nikola.

This is most likely a bug with lxml, please report it to the lxml project.

If I use the program ./nikola/bin/rst2html.py (on macOS) to convert my .rst post I have no problem.
I get:

[...]
<div class="document">


<!-- title: Unicode -->
<!-- slug: unicode -->
<!-- date: 2023-05-08 18:48:37 UTC+02:00 -->
<!-- tags: -->
<!-- category: -->
<!-- link: -->
<!-- description: -->
<!-- type: text -->
<p>😺</p>
</div>
</body>
</html>

I have no idea how lxml is used by Nikola.
Can you provide a sample code using lxml that should fail so I can report the issue to lxml?

Sure, here’s some sample code:

import lxml.html
html = """<!DOCTYPE html>
<head><meta charset="utf-8"></head>
<body>
<h1>Hello, world!</h1>
<div>
<p>\U0001f63a</p>
</div>
</body>
</html>"""

parser = lxml.html.HTMLParser(remove_blank_text=True)
doc = lxml.html.document_fromstring(html, parser)
data = lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True, doctype='<!DOCTYPE html>')
print(data)

Can you reproduce the issue using this code on macOS? For reference, I get the following output on Windows and Linux:

b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8"></head>\n<body>\n<h1>Hello, world!</h1>\n<div>\n<p>\xf0\x9f\x98\xba</p>\n</div>\n</body>\n</html>\n'

Bingo!
On macOS I get:

>>> print(data)
b'<!DOCTYPE html>\n<html><body><p>!   D   O   C   T   Y   P   E       h   t   m   l   &gt;   \n   </p></body></html>\n'

I reported the lxml issue at https://bugs.launchpad.net/lxml/+bug/2019038
Thanks