buriy/python-readability

Splitting the text in scoring

haziyevv opened this issue · 4 comments

In the score_paragraphs method content score is calculated like this:
content_score += len(inner_text.split(','))

But I think it should be like below, because there may be no comma in a text.
content_score += len(re.split(' |,',inner_text))

Also I think this may be added: Do not take into account non words and words with length less than 3
inner_text = " ".join(re.findall("[^\d\W]{3,}", inner_text))

buriy commented

This is a typical counter-intuitive situation where "more is better" strategy doesn't work. More separators isn't better, because this purpose is made with a different goal in mind.
"|" is rarely used in texts, but often in titles -- so scoring it would have a negative impact.
"," is rarely used in titles, often in larger texts -- that's why it's counted.
Counting spaces -- one will need to rescale the score and also they won't distinguish between good content and bad content.
The last comment looks partially valid, symbols doesn't make the text better, but punctuation is a sign of text, what's the purpose of ignoring it?
Have you evaluated the impact of your changes in practice?

Thank you for replying. Yes I have applied and actually it was effective. Before I was not able to get content of a page, just the footer, but after those changes I was able to get the content. May be it is because of the input I used. I used news pages. Because in most pages there may not be commas, but there is a big bunch of text, but in the footer there are lots of commas. For example,

this department is situated in Baku, Azerbaijan, 21thditsti, postcode xx

.

buriy commented

Thanks for a valid counter-example, this package is designed for news pages but was modeled from English ones and doesn't consider such use-case. Rather I would suggest a discount on commas counting then, and will consider its implementation in next update -- I'm trying to do package updates at least once per 3 months.
This package is made to collect from hundreds/thousand news sources and could behave bad on some specific ones. For quick tuning, positive/negative keywords should work better than other solutions.

buriy commented

@faridhaziyev please don't close this issue.
Once I'll have time for maintenance, I'll add this improvement.