Pages with Emoji fail to render on macOS (due to lxml bug)
LudovicRousseau opened this issue · 4 comments
Environment
Python Version:
Python 3.11.3 (main, Apr 7 2023, 19:29:16) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
Installed from Homebrew
Nikola Version:
Nikola 8.2.4
Operating System:
macOS Monterey 12.6.5
Description:
If I use a Unicode character like a smiley or 😺 in a .rst
page then the generated html is bogus.
Source unicode.rst
page:
.. title: Unicode
.. slug: unicode
.. date: 2023-05-08 18:48:37 UTC+02:00
.. tags:
.. category:
.. link:
.. description:
.. type: text
😺
The generated html page contains:
[...]
</header><div class="e-content entry-content" itemprop="articleBody text">
<p>h t m l > </p>
</div>
[...]
And in the browser I see: "h t m l > " for the content of the post.
I have no problem with another Unicode character like an accented letter like "è".
Debian is OK
I then tried the same manipulation on a Debian GNU/Linux version 12 (the next Debian stable) and I have no problem.
On Debian I use:
- Python 3.11.2 (main, Mar 13 2023, 12:18:29) [GCC 12.2.0] on linux
- Nikola 8.2.4
In both cases I use a venv.
Debug
I tried to debug but I am new to Nikola.
I tried nikola rst2html
.
On macOS I get:
$ nikola rst2html posts/2023/05/unicode.rst
<!DOCTYPE html>
<html><body><p>! D O C T Y P E h t m l >
</p></body></html>
Here again the result is correct if run on Debian.
Maybe the problem is in a dependency used by Nikola.
This is most likely a bug with lxml, please report it to the lxml project.
If I use the program ./nikola/bin/rst2html.py
(on macOS) to convert my .rst
post I have no problem.
I get:
[...]
<div class="document">
<!-- title: Unicode -->
<!-- slug: unicode -->
<!-- date: 2023-05-08 18:48:37 UTC+02:00 -->
<!-- tags: -->
<!-- category: -->
<!-- link: -->
<!-- description: -->
<!-- type: text -->
<p>😺</p>
</div>
</body>
</html>
I have no idea how lxml is used by Nikola.
Can you provide a sample code using lxml that should fail so I can report the issue to lxml?
Sure, here’s some sample code:
import lxml.html
html = """<!DOCTYPE html>
<head><meta charset="utf-8"></head>
<body>
<h1>Hello, world!</h1>
<div>
<p>\U0001f63a</p>
</div>
</body>
</html>"""
parser = lxml.html.HTMLParser(remove_blank_text=True)
doc = lxml.html.document_fromstring(html, parser)
data = lxml.html.tostring(doc, encoding='utf8', method='html', pretty_print=True, doctype='<!DOCTYPE html>')
print(data)
Can you reproduce the issue using this code on macOS? For reference, I get the following output on Windows and Linux:
b'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8"></head>\n<body>\n<h1>Hello, world!</h1>\n<div>\n<p>\xf0\x9f\x98\xba</p>\n</div>\n</body>\n</html>\n'
Bingo!
On macOS I get:
>>> print(data)
b'<!DOCTYPE html>\n<html><body><p>! D O C T Y P E h t m l > \n </p></body></html>\n'
I reported the lxml issue at https://bugs.launchpad.net/lxml/+bug/2019038
Thanks