isProbablyReaderable

Question

isProbablyReaderable

Opened this issue 3 years ago · 3 comments

How difficult would it be to implement isProbablyReaderable(doc, options) (from https://github.com/mozilla/readability#isprobablyreaderabledocument-options).

This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed.

Would this be hard to implement? I could also try working on it.

Answer 1 · 2022-05-27T05:20:12.000Z

It's not difficult to implement in that way, but I'm afraid you won't get any big improvement in parsing time (now typical article processing time is 0.1-0.4 s per page), nor it's reliable, or, to be more precise:

If you use minScore check, readability algorithm is completely the same but without cleaning phase, will take almost the same time.
If you could only check HTML, it's completely unreliable.

Answer 2 · 2022-05-28T15:29:21.000Z

Oh I see. What could I do to use readability to check if a webpage actually has like interesting content?

Where an actual article passes this check and something like the google homepage doesn't.

Answer 3 · 2022-05-29T05:05:53.000Z

The main check should be whether there's something to read: text with length starting from 300 chars. Ideally, 500+ chars.
You can check this after processinging with readability: just convert to text and check the length.