REGEXES["divToPElementsRe"] logical error
luoqishuai opened this issue · 3 comments
luoqishuai commented
In readability transform_misused_divs_into_paragraphs
for elem in self.tags(self.html, "div"):
if not REGEXES["divToPElementsRe"].search(str_(b"".join(map(tostring_, list(elem))))):
Because elem always has "div", re.search will never take effect
demo
from readability.readability import *
import re
doc=Document('<div></div>')
print(tostring_(doc._html()))
node_list=[node for node in doc.tags(doc.html,'div')]
search_str=''.join(map(lambda x:tostring_(x).decode(),node_list))
re.search('<(a|blockquote|dl|div|img|ol|p|pre|table|ul)',search_str)
output
b'<html><body><div/></body></html>'
<_sre.SRE_Match object; span=(0, 4), match='<div'>
Please let me know if I get it wrong
buriy commented
tostring_ gets HTML and text inside the elements.
luoqishuai commented
compat/init.py:
from lxml.etree import tostring
def tostring_(s):
return tostring(s, encoding='utf-8')
I run
tostring(node_list[0])
output
b'<div>a</div>'
It looks like tostring(node) also contains node's tag
buriy commented
ok thanks i'll fix that. This was supposed to replace div to p if they contain only text but no tags inside.