rss订阅:P标签不会单行显示
ifwlzs opened this issue · 2 comments
ifwlzs commented
环境
- nonebot-bison 版本:0.9.2
- nonebot 版本:2.2.1
- 安装方式:1(以下方式的一种或者其他方式)
- 通过 nb-cli 安装
- 使用 poetry/pdm 等现代包管理器安装
- 通过 pip install 安装
- 克隆或下载项目直接使用
- 操作系统:windows 2009 (19045.4710)
问题
rss订阅中P标签的文字不会单行显示
日志
请在这里粘贴你的日志
- [ √ ] 我搜索过了 issue,但是并没有发现过与我类似的问题
- [ √ ] 我确认在日志中去掉了敏感信息
suyiiyii commented
问题的原因在第 68 行这里,用 bs 库获取 html 的文本的时候丢失了<p>
标签等格式信息
nonebot-bison/nonebot_bison/platform/rss.py
Lines 65 to 69 in 1c753f7
In [23]: doc = """
...: terterthv<p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin
...: """
In [24]: soup = bs(doc,"html.parser")
In [25]: soup.get_text()
Out[25]: '\nterterthvcxiobjhoijeraoijgiojoidfgjkldfjgiojbvcxninclin\n'
bs 获取文本换行逻辑
似乎是根据 html 的换行来进行处理的
From https://www.crummy.com/software/BeautifulSoup/bs4/doc/
html_doc = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""
print(soup.get_text())
# The Dormouse's story
#
# The Dormouse's story
#
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
#
# ...
In [14]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
...: p>"""
In [15]: soup = BeautifulSoup(html_doc, 'html.parser');print(soup.get_text())
The Dormouse's storyThe Dormouse's storyOnce upon a time there were three little sisters; and their names wereElsie,Lacie andTillie;and they lived at the bottom of a well....
手动预处理 html
获取描述后先手动进行预处理,例如将<p>
替换为<br>
,再将<br>
替换为\n
再将处理过后的 html 丢给 bs 处理,获得带有格式的文本
html2text
这个库可以把 html 转换成 markdown
In [27]: html_doc = """<html><head><title>The Dormouse's story</title></head><body><p class="title"><b>The Dormouse's story</b></p><
...: p class="story">Once upon a time there were three little sisters; and their names were<a href="http://example.com/elsie" cl
...: ass="sister" id="link1">Elsie</a>,<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and<a href="http:/
...: /example.com/tillie" class="sister" id="link3">Tillie</a>;and they lived at the bottom of a well.</p><p class="story">...</
...: p>"""
In [28]: h = html2text.HTML2Text()
In [29]: h.ignore_links = True
In [30]: print(h.handle(html_doc))
**The Dormouse's story**
Once upon a time there were three little sisters; and their names
wereElsie,Lacie andTillie;and they lived at the bottom of a well.
...
In [31]: html_doc = """<html><body><p>cxiobjhoijeraoi</p>jgiojoidfgjk<p>ldfjgioj</p>bvcxninclin</body></html>"""
In [32]: h = html2text.HTML2Text()
In [33]: h.ignore_links = True
In [34]: print(h.handle(html_doc))
cxiobjhoijeraoi
jgiojoidfgjk
ldfjgioj
bvcxninclin
经过处理可以获得较为美观的纯文本
felinae98 commented
我记得weibo还是什么地方也有类似(手撮的)处理 html 的文本,统一处理一下?