buriy/python-readability

REGEXES["divToPElementsRe"] logical error

luoqishuai opened this issue · 3 comments

In readability transform_misused_divs_into_paragraphs

for elem in self.tags(self.html, "div"):
    if not REGEXES["divToPElementsRe"].search(str_(b"".join(map(tostring_, list(elem))))):

Because elem always has "div", re.search will never take effect

demo

from readability.readability import *
import re
doc=Document('<div></div>')
print(tostring_(doc._html()))
node_list=[node for node in doc.tags(doc.html,'div')]
search_str=''.join(map(lambda x:tostring_(x).decode(),node_list))
re.search('<(a|blockquote|dl|div|img|ol|p|pre|table|ul)',search_str)

output

b'<html><body><div/></body></html>'
 <_sre.SRE_Match object; span=(0, 4), match='<div'>

Please let me know if I get it wrong

buriy commented

tostring_ gets HTML and text inside the elements.

compat/init.py:

from lxml.etree import tostring
def tostring_(s):
    return tostring(s, encoding='utf-8')

I run
tostring(node_list[0])

output
b'<div>a</div>'

It looks like tostring(node) also contains node's tag

buriy commented

ok thanks i'll fix that. This was supposed to replace div to p if they contain only text but no tags inside.